January 2006 Archives

More Advancements in Perl Programming

By Simon Cozens on January 26, 2006 12:00 AM

Around Easter last year, I finished writing the second edition of Advanced Perl Programming, a task that had been four years in the making. The aim of this new edition was to reflect the way that Perl programming had changed since the first edition. Much of what Sriram wrote in the original edition was still true, but to be honest, not too much of it was useful anymore--the Perl world has changed dramatically since the original publication.

The first edition was very much about how to do things yourself; it operated at a very low level by current Perl standards. With the explosion of CPAN modules in the interim, "advanced Perl programming" now consists of plugging all of the existing components together in the right order, rather than necessarily writing the components from scratch. So the nature of the book had to change a lot.

However, CPAN is still expanding, and the Perl world continues to change; Advanced Perl Programming can never be a finished book, but only a snapshot in time. On top of all that, I've been learning more, too, and discovering more tricks to get work done smarter and faster. Even during the writing of the book, some of the best practices changed and new modules were developed.

The book is still, I believe, an excellent resource for learning how to master Perl programming, but here, if you like, I want to add to that resource. I'll try to say something about the developments that have happened in each chapter of the book.

Advanced Perl

I'm actually very happy with this chapter. The only thing I left out of the first chapter which may have been useful there is a section on tie; but this is covered strongly in Programming Perl anyway.

On the other hand, although it's not particularly advanced, one of the things I wish I'd written about in the book was best practices for creating object-oriented modules. My fellow O'Reilly author Damian Conway has already written two books about these topics, so, again, I didn't get too stressed out about having to leave those sections out. That said, the two modules I would recommend for building OO classes don't appear to get a mention in Perl Best Practices.

First, we all know it's a brilliant idea to create accessors for our data members in a class; however, it's also a pain in the neck to create them yourself. There seem to be hundreds of CPAN modules that automate the process for you, but the easiest is the Class::Accessor module. With this module, you declare which accessors you want, and it will automatically create them. As a useful bonus, it creates a default new() method for you if you don't want to write one of those, either.

Instead of:

package MyClass;

sub new { bless { %{@_} }, shift; }

sub name {
    my $self = shift;
    if (@_) { $self->{name} = shift; }
    $self->{name}
}

sub address {
    my $self = shift;
    if (@_) { $self->{address} = shift; }
    $self->{address}
}

you can now say:

package MyClass;
use base qw(Class::Accessor);

MyClass->mk_accessors(qw( name address ));

Class::Accessor also contains methods for making read-only accessors and for creating separate read and write accessors, and everything is nicely overrideable. Additionally, there are subclasses that extend Class::Accessor in various ways: Class::Accessor::Fast trades off a bit of the extensibility for an extra speed boost, Class::Accessor::Chained returns the object when called with parameters, and Class::Accessor::Assert does rudimentary type checking on the parameter values. There are many, many modules on the CPAN that do this sort of thing, but this one is, in my opinion, the most flexible and simple.

Speaking of flexibility, one way to encourage flexibility in your modules and applications is to make them pluggable--that is, to allow other pieces of code to respond to actions that you define. Module::Pluggable is a simple but powerful little module that searches for installed modules in a given namespace. Here's an example of its use in Email::FolderType:

use Module::Pluggable 
    search_path => "Email::FolderType", 
    require     => 1, 
    sub_name    => 'matchers';

This looks for all modules underneath the Email::FolderType:: namespace, requires them, and assembles a list of their classes into the matchers method. The module later determines the type of an email folder by passing it to each of the recognizers and seeing which of them handles it, with the moral equivalent of:

sub folder_type {
    my ($self, $folder) = @_;
    for my $class ($self->matchers) {
        return $class if $class->match($folder);
    }
}

This means you don't need to know, when you're writing the code, what folder types you support; you can start off with no recognizers and add them later. If a new type of email folder comes along, the user can install a third-party module from CPAN that deals with it, and Email::FolderType requires no additional coding to add support for it.

Parsing

Perhaps the biggest change of heart I had between writing a chapter and its publication was in the parsing chapter. That chapter had very little about parsing HTML, and what it did have was not very friendly. Since then, Gisle Aas and Sean Burke's HTML::TreeBuilder and the corresponding XML::TreeBuilder have established themselves as much simpler and more flexible ways to navigate HTML and XML documents.

The basic concept in HTML::TreeBuilder is the HTML element, represented as an object of the HTML::Element class:

$a = HTML::Element->new('a', href => 'http://www.perl.com/');
$html = $a->as_HTML;

This creates a new element that is an anchor tag, with an href attribute. The HTML equivalent in $html would be <a href="http://www.perl.com"/>.

Now you can add some content to that tag:

$a->push_content("The Perl Homepage");

This time, the object represents <a href="http://www.perl.com"> The Perl Homepage </a>.

You can ask this element for its tag, its attributes, its content, and so on:

$tag = $a->tag;
$link = $a->attr("href");
@content = $a->content_list; # More HTML::Element nodes

Of course, when you are parsing HTML, you won't be creating those elements manually. Instead, you'll be navigating a tree of them, built out of your HTML document. The top-level module HTML::TreeBuilder does this for you:

use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new();
$tree->parse_file("index.html");

Now $tree is a HTML::Element object representing the <html> tag and all its contents. You can extract all of the links with the extract_links() method:

for (@{ $tree->extract_links() || [] }) {
     my($link, $element, $attr, $tag) = @$_;
     print "Found link to $link in $tag\n";
}

Although the real workhorse of this module is the look_down() method, which helps you pull elements out of the tree by their tags or attributes. For instance, in a search engine indexer, indexing HTML files, I have the following code:

for my $tag ($tree->look_down("_tag","meta")) {
    next unless $tag->attr("name");
    $hash{$tag->attr("name")} .= $tag->attr("content"). " ";
}

$hash{title} .= $_->as_text." " for $tree->look_down("_tag","title");

This finds all <meta> tags and puts their attributes as name-value pairs in a hash; then it puts all the text inside of <title> tags together into another hash element. Similarly, you can look for tags by attribute value, spit out sub-trees as HTML or as text, and much more, besides. For reaching into HTML text and pulling out just the bits you need, I haven't found anything better.

On the XML side of things, XML::Twig has emerged as the usual "middle layer," when XML::Simple is too simple and XML::Parser is, well, too much like hard work.

Templating

There's not much to say about templating, although in retrospect, I would have spent more of the paper expended on HTML::Mason talking about the Template Toolkit instead. Not that there's anything wrong with HTML::Mason, but the world seems to be moving away from templates that include code in a specific language (say, Perl's) towards separate templating little languages, like TAL and Template Toolkit.

The only thing to report is that Template Toolkit finally received a bit of attention from its maintainer a couple of months ago, but the long-awaited Template Toolkit 3 is looking as far away as, well, Perl 6.

Natural Language Processing

Who would have thought that the big news of 2005 would be that Yahoo is relevant again? Not only are they coming up with interesting new search technologies such as Y!Q, but they're releasing a lot of the guts behind what they're doing as public APIs. One of those that is particularly relevant for NLP is the Term Extraction web service.

This takes a chunk of text and pulls out the distinctive terms and phrases. Think of this as a step beyond something like Lingua::EN::Keywords, with the firepower of Yahoo behind it. To access the API, simply send a HTTP POST request to a given URL:

use LWP::UserAgent;
use XML::Twig;
my $uri  = "http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction";
my $ua   = LWP::UserAgent->new();
my $resp = $ua->post($uri, {
    appid   => "PerlYahooExtractor",
    context => <<EOF
Two Scottish towns have seen the highest increase in house prices in the
UK this year, according to new figures. 
Alexandria in West Dunbartonshire and Coatbridge in North Lanarkshire
both saw an average 35% rise in 2005. 
EOF
});
if ($resp->is_success) { 
    my $xmlt = XML::Twig->new( index => [ "Result" ]);
    $xmlt->parse($resp->content);
    for my $result (@{ $xmlt->index("Result") || []}) {
        print $result->text;
    }
}

This produces:

north lanarkshire
scottish towns
west dunbartonshire
house prices
coatbridge
dunbartonshire
alexandria

Once I had informed the London Perl Mongers of this amazing discovery, Simon Wistow immediately bundled it up into a Perl module called Lingua::EN::Keywords::Yahoo, coming soon to a CPAN mirror near you.

Unicode

The best news about Unicode over the last year is that you should not have noticed any major changes. By now, the core Unicode support in Perl just works, and most of the CPAN modules that deal with external data have been updated to work with Unicode.

If you don't see or hear anything about Unicode, that's a good thing: it means it's all working properly.

POE

The chapter on POE was a great introduction to how POE works and some of the things that you can do with it, but it focused on using POE for networking applications and for daemons. This is only half the story. Recently a lot of interest has centered on using POE for graphical and command-line applications: Randal Schwartz takes over from the RSS aggregator at the end of the chapter by integrating it with a graphical interface in "Graphical interaction with POE and Tk." Here, I want to consider command-line applications.

The Term::Visual module is a POE component for creating applications with a split-screen interface; at the bottom of the interface, you type your input, and the output appears above a status line. The module handles all of the history, status bar updates, and everything else for you. Here's an application that uses Chatbot::Eliza to provide therapeutic session with everyone's favorite digital psychiatrist.

First, set up the chatbot and create a new Term::Visual object:

#!/usr/bin/perl -w
use POE;
use POSIX qw(strftime);
use Term::Visual;
use Chatbot::Eliza;
my $eliza = Chatbot::Eliza->new();
my $vt    = Term::Visual->new( Alias => "interface" );

Now create the window, which will have space on its status bar for a clock:

my $window_id = $vt->create_window(
   Status => { 0 => { format => "[%8.8s]", fields => ["time"] } },
   Title => "Eliza" 
);

You also need a POE::Session, which will do all the work. It will have three states; the first is the _start state, to tell Term::Visual what to do with any input it gets from the keyboard and to update the clock:

POE::Session->create
(inline_states =>
  { _start          => sub {
        $_[KERNEL]->post( interface => send_me_input => "got_term_input" );
        $_[KERNEL]->yield( "update_time" );
    },

Updating the clock is simply a matter of setting the time field declared earlier to the current time, and scheduling another update at the top of the next minute:

    update_time     => sub {
        $vt->set_status_field( $window_id,
                               time => strftime("%I:%M %p", localtime) );
        $_[KERNEL]->alarm( update_time => int(time() / 60) * 60 + 60 );
    },

Finally, you need to handle the input from the user. Do that in a separate subroutine to make things a big clearer:

    got_term_input  => \&handle_term_input,
  }
);

$poe_kernel->run();

When Term::Visual gets a line of text from the user, it passes it to the state declared in the _start state. The code takes that text, prints it to the terminal as an echo, and then passes it through Eliza:

sub handle_term_input {
  my ($heap, $input) = @_[HEAP, ARG0];
  if ($input =~ m{^/quit}i) {
    $vt->delete_window($window_id); 
    exit;
  }

  $vt->print($window_id, "> $input");
  $vt->print($window_id, $eliza->transform($input));
}

In just a few lines of code you have a familiar interface, similar to many IRC or MUD clients, with POE hiding all of the event handling away.

Testing

Advanced Perl Programming showed how to write tests so that we all can be more sure that our code is doing what it should. How do you know your tests are doing enough? Enter Paul Johnson's Devel::Cover!

Devel::Cover makes a record of each time a Perl operation or statement is executed, and then compares this against the statements in your code. So when you're running your tests, you can see which of the code paths in your module get exercised and which don't; if you have big branches of code that never get tested, maybe you should write more tests for them!

To use it on an uninstalled module:

$ cover -delete
$ HARNESS_PERL_SWITCHES=-MDevel::Cover make test
$ cover

This will give you a textual summary of code coverage; cover -report html produces a colorized, navigable hypertext summary, useful for showing to bosses.

This ensures that your code works--or at least, that it does what your tests specify. The next step is to ensure that your code is actually of relatively decent quality. Because "quality" is a subjective metric when it comes to the art of programming, Perl folk have introduced the objective of "Kwalitee" instead, which may or may not have any bearing on quality.

All modules on CPAN have their Kwalitee measured as part of the CPANTS (CPAN Testing Service) website. One way to test for and increase your Kwalitee is to use the Module::Build::Kwalitee module; this copies some boilerplate tests into your distribution that ensure that you have adequate and syntactically correct documentation, that you use strict and warnings, and so on.

All of this ought to go a fair way to improving the Kwalitee of your code, if not its actual quality!

Inline

One of the things that has come over into Perl 5 from Perl 6 development is the concept of the Native Call Interface (NCI). This hasn't fully been developed yet, but chromatic (yes, the editor of this very site) has been working on it.

The idea is that, instead of having something like Inline or XS that creates a "buffer" between Perl and C libraries, you just call those libraries directly. At the moment, you need to compile any XS module against the library you're using. This is particularly awkward for folk on cut-down operating systems that do not ship a compiler, such as Palm OS or Windows.

The strength of NCI is that it doesn't require a compiler; instead, it uses the operating system's normal means of making calls into libraries. (Hence "Native Call.") It uses Perl's DynaLoader to find libraries, load them, and then find the address of symbols inside of the library. Then it calls a generic "thunk" function to turn the symbol's address into a call. For instance:

my $lib = P5NCI::Library->new( library => 'nci_test', package => 'NCI' );
$lib->install_function( 'double_int', 'ii' );

my $two = NCI::double_int( 1 );

These lines find the nci_test shared library and get ready to put its functions into the NCI namespace. It then installs the function double_int, which is of signature int double_int(int) (hence ii). Once this is done, you can call the function from Perl. It's not much trickier than Inline, but without the intermediate step of compilation.

NCI isn't quite there yet, and it only supports very simple function signatures. However, because of its portability, it's definitely the one to watch for Perl-C interfaces in the future.

Everything Else

The last chapter is "Fun with Perl." Now, much has happened in the world of Perl fun, but much has happened all over Perl. There were many other things I wanted to write about, as well: CPAN best practices for date/time handling and email handling, Perl 6 and Pugs, the very latest web application frameworks such as Catalyst and Jifty, and so on. But all these would fill another book--and if I ever finished that, it too would require an update like this one. So I hope this is enough for you to be getting on with!

Analyzing HTML with Perl

By Kendrew Lau on January 19, 2006 12:00 AM

Routine work is all around us every day, no matter if you like it or not. For a teacher on computing subjects, grading assignments can be such work. Certain computing assignments aim at practicing operating skills rather than creativity, especially in elementary courses. Grading this kind of assignment is time-consuming and repetitive, if not tedious.

In a business information system course that I taught, one lesson was about writing web pages. As the course was the first computing subject for the students, we used Nvu, a WYSIWYG web page editor, rather than coding the HTML. One class assignment required writing three or more inter-linked web pages containing a list of HTML elements.

Write three or more web pages having the following:

Italicized text (2 points)
Bolded text (2 points)
Three different colors of text (5 points)
Three different sizes of text (5 points)
Linked graphics with border (5 points)
Linked graphics without border (5 points)
Non-linked graphics with border (3 points)
Non-linked graphics without border (2 points)
Three external links (5 points)
One horizontal line--not full width of page (5 points)
Three internal links to other pages (10 points)
Two tables (10 points)
One bulleted list (5 points)
One numerical list (5 points)
Non-default text color (5 points)
Non-default link color (2 points)
Non-default active link color (2 points)
Non-default visited link color (2 points)
Non-default background color (5 points)
A background image (5 points)
Pleasant appearance in the pages (10 points)

Beginning to grade the students' work, I found it monotonous and error-prone. Because the HTML elements could be in any of the pages, I had to jump to every page and count the HTML elements in question. I also needed to do it for each element in the requirement. While some occurrences were easy to spot in the rendered pages in a browser, others required close examination of the HTML code. For example, a student wrote a horizontal line (<hr> element) extending 98 percent of the width of the window, which was difficult to differentiate visually from a full-width horizontal line. Some other students just liked to use black and dark gray as two different colors in different parts of the pages. In addition to locating the elements, awarding and totaling marks were also error-prone.

I felt a little regret on the flexibility in the requirement. If I had fixed the file names of the pages and assigned the HTML elements to individual pages, grading could have been easier. Rather than continuing the work with regret, I wrote a Perl program to grade the assignments. The program essentially parses the web pages, awards marks according to the requirements, writes basic comments, and calculates the total score.

Processing HTML with Perl

Perl's regular expressions have excellent text processing capability and there are handy modules for parsing web pages. The module HTML::TreeBuilder provides a HTML parser that builds a tree structure of the elements in a web page. It is easy to create a tree and build its content from a HTML file:

$tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);

Nodes in the tree are HTML::Element objects. There are plenty of methods with which to access and manipulate elements in the tree. When you finish using the tree, destroy it and free the memory it occupied:

$tree->delete;

The module HTML::Element represents HTML elements in tree structures created by HTML::TreeBuilder. It has a huge number of methods for accessing and manipulating the element and searching for descendants down the tree or ancestors up the tree. The method find() retrieves all descending elements with one or more specified tag names. For example:

@elements = $element->find('a', 'img');

stores all <a> and <img> elements at or under $element to the array @elements. The method look_down() is a more powerful version of find(). It selects descending elements by three kinds of criteria: exactly specifying an attribute's value or a tag name, matching an attribute's value or tag name by a regular expression, and applying a subroutine that returns true on examining desired elements. Here are some examples:

@anchors = $element->look_down('_tag' => 'a');

retrieves all <a> elements at or under $element and stores them to the array @anchors.

@colors = $element->look_down('style' => qr/color/);

selects all elements at or under $element having a style attribute value that contains color.

@largeimages = $element->look_down(
    sub {
         $_[0]->tag() eq 'img'          and
        ($_[0]->attr('width') > 100 or
         $_[0]->attr('height')  > 100)
    }
);

locates at or under $element all images (<img> elements) with widths or heights larger than 100 pixels. Note that this code will produce a warning message on encountering an <img> element that has no width or height attribute.

You can also mix the three kinds of criteria into one invocation of look_down. The last example could also be:

@largeimages = $element->look_down(
    '_tag'   => 'img',
    'width'  => qr//,
    'height' => qr//,
    sub { $_[0]->attr('width')  > 100 or
          $_[0]->attr('height') > 100 }
);

This code also caters for any missing width or height attribute in an <img> element. The parameters 'width' => qr// and 'height' => qr// guarantee selection of only those <img> elements that have both width or height attributes. The code block checks these for the attribute values, when invoked.

The method look_up() looks for ancestors from an element by the same kinds of criteria of look_down().

Processing Multiple Files

These methods provide great HTML parsing capability to grade the web page assignments. The grading program first builds the tree structures from the HTML files and stores them in an array @trees:

my @trees;
foreach (@files) {
    print "  building tree for $_ ...\n" if $options{v};
    my $tree = HTML::TreeBuilder->new;
    $tree->parse_file($_);
    push( @trees, $tree );
}

The subroutine doitem() iterates through the array of trees, applying a pass-in code block to look for particular HTML elements in each tree and accumulating the results of calling the code block. To provide detailed information and facilitate debugging during development, it calls the convenience subroutine printd() to display the HTML elements found with their corresponding file name when the verbose command line switch (-v) is set. Essentially, the code invokes this subroutine once for each kind of element in the requirement.

sub doitem {
    my $func = shift;
    my $num  = 0;
    foreach my $i ( 0 .. $#files ) {
        my @elements = $func->( $files[$i], $trees[$i] );
        printd $files[$i], @elements;
        $num += @elements;
    }
    return $num;
}

The code block passed into doitem is a subroutine that takes two parameters of a file name and its corresponding HTML tree and returns an array of selected elements in the tree. The following code block retrieves all HTML elements in italic, including the  elements (for example, text) and elements with a font-style of italic (for example, text).

$n = doitem sub {
    my ( $file, $tree ) = @_;
    return ( $tree->find("i"),
        $tree->look_down( "style" => qr/font-style *: *italic/ ) );
    };

marking "Italicized text (2 points): "
  . ( ( $n > 0 ) ? "good. 2" : "no italic text. 0"
);

Two points are available for any italic text in the pages. The marking subroutine records grading in a string. At the end of the program, examining the string helps to calculate the total points.

Other requirements are marked in the same manner, though some selection code is more involved. A regular expression helps to select elements with non-default colors.

my $pattern = qr/(^|[^-])color *: *rgb\( *[0-9]*, *[0-9]*, *[0-9]*\)/;
return $tree->look_down(
    "style" => $pattern,
    sub { $_[0]->as_trimmed_text ne "" }
);

Nvu applies colors to text by the color style in the form of rgb(R,G,B) (for example, text). The above code is slightly stricter than the italic code, as it also requires an element to contain some text. The method as_trimmed_text() of HTML::Element returns the textual content of an element with any leading and trailing spaces removed.

Nested invocations of look_down() locate linked graphics with a border. This selects any link (an <a> element) that encloses an image (an <img> element) that has a border.

return $tree->look_down(
    "_tag" => "a",
    sub {
       $_[0]->look_down( "_tag" => "img", sub { hasBorder( $_[0] ) } );
    }
);

Finding non-linked graphics is more interesting, as it involves both the methods look_down() and look_up(). It should only find images (<img> elements) that do not have a parent link (a <a> element) up the tree.

return $tree->look_down(
    "_tag" => "img",
    sub { !$_[0]->look_up( "_tag" => "a" ) and hasBorder( $_[0] ); }
);

Checking valid internal links requires passing look_down() a code block that excludes common external links by checking the href value against protocol names, and verifies the existence of the file linked in the web page.

use File::Basename;
$n = doitem sub {
    my ( $file, $tree ) = @_;
    return $tree->look_down(
        "_tag" => "a",
        "href" => qr//,
        sub {
            !( $_[0]->attr("href") =~ /^ *(http:|https:|ftp:|mailto:)/)
            and -e dirname($file) . "/" . decodeURL( $_[0]->attr("href") );
        }
    );
};

Nvu changes a page's text color by specifying the color components in the style of the body tag, like <body style="color: rgb(0, 0, 255);">. A regular expression matches the style pattern and retrieves the three color components. Any non-zero color component denotes a non-default text color in a page.

my $pattern = qr/(?:^|[^-])color *: *rgb\(( *[0-9]*),( *[0-9]*),( *[0-9]*)\)/;
return $tree->look_down(
    "_tag"  => "body",
    "style" => qr//,
    sub {
        $_[0]->attr("style") =~ $pattern and
        ( $1 != 0 or $2 != 0 or $3 != 0 );
    }
);

With proper use of the methods look_down(), look_up(), and as_trimmed_text(), the code can locate and mark the existence of various required elements and any broken elements (images, internal links, or background images).

Finishing Up

The final requirement of the assignment is a pleasant look of the rendered pages. Unfortunately, HTML::TreeBuilder and its related modules do not analyze and quantify the visual appearance of a web page. Neither does any module that I know. OK, I would award marks for the appearance myself but still want Perl to help in the process--the program sets the default score and comment, and allows overriding them in flexible way. By using alternative regular expressions, I can accept the default, override the score only, or override both the score and comment.

my $input = "";
do {
    print "$str1 [$str2]: ";
    $input = <STDIN>;
    $input =~ s/(^\s+|\s+$)//g;
} until ( $input =~ /(.*\.\s+\d+$|^\s*$|^\d+$)/ );

$input = $str2 if $input eq "";
if ( $input =~ /^\d+$/ ) {
    $n = $input;
    if ( $n == 10 ) {
        $input = "good looking, nice content. $n";
    }
    else {
        ( $input = $str2 ) =~ s/(\.\s*)\d+\s*$/$1$n/;
    }
}
marking "$str1 $input";

Finally, the code examines the marking text string containing comments and scores for each requirement to calculate the total score of the assignment. Each line in that string is in a fixed format (for example, "Italicized text (2 points): good. 0"). Again, regular expressions retrieve and accumulate the maximum and awarded points.

my ( $total, $score ) = ( 0, 0 );
while ( $marktext =~ /.*?\((\d+)\s+points\).*?\.\s+(\d+)/g )
{
    $total += $1;
    $score += $2;
}
marking "Total ($total points): $score";

Depending on the command-line switches, the program may start a browser to show the first page so that I can look at the pages' appearance. It can also optionally write the grading comments and score to a text file which can be feedback for the student.

I can simply run the program in the directory containing the HTML files, or specify the set of HTML files in the command-line arguments. In the best case, I just let it grade the requirements and press Enter to accept the default marking for the appearance, and then jot down the total score and email the grading text file to the student.

Conclusion

I did not evaluate the time saved by the program against its developing effort. Anyway, the program makes the grading process more accurate and less prone to error, and it is more fun to spend time writing a Perl program and getting familiar with useful modules.

In fact, there are many other modules that could have been used in the program to provide even more automation. Had I read Wasserman's article "Automating Windows Applications with Win32::OLE," the program would record the final score to an Excel file automatically. In addition, networking modules such as Mail::Internet, Mail::Mailer, and Mail::Folder could retrieve the assignment files from emails and send the feedback files to the students directly from the program.

What Is Perl 6

By chromatic on January 12, 2006 12:00 AM

Perl 6 is the long-awaited redesign and reimplementation of the popular and venerable Perl programming language. It's not out yet--nor is there an official release date--but the design and implementations make continual progress.

Why Perl 6

Innumerable programmers, hackers, system administrators, hobbyists, and dabblers write Perl 5 quite successfully. The language doesn't have the marketing budget of large consulting companies, hardware manufacturers, or tool vendors pushing it, yet people still use it to get their jobs done.

Why argue with that success? Why redesign a language that's working for so many people and in so many domains? Sure, Perl 5 has some warts, but it does a lot of things very well.

What's Right with Perl 5

As Adam Turoff explained once, Perl has two subtle advantages: manipulexity and whipuptitude. It's very important to be able to solve the problem at hand simply and easily without languages and tools and syntax getting in your way. That's whipuptitude. Manipulexity is the ability to use simple tools and build a sufficiently complex solution to a complex problem.

Not everyone who starts learning Perl for whipuptitude needs manipulexity right away, if ever, but having a tool that supports both is amazingly useful. That's where Perl's always aimed--making the easy things easy and the hard things possible, even if you don't traditionally think of yourself as a programmer.

Many of Perl 5's other benefits fall out from this philosophy. For example, though the popular conception is that Perl 5 is mostly a procedural language, there are plenty of functional programming features available--iterators, higher-order functions, lexical closures, filters, and more. The (admittedly minimal) object system also has a surprising amount of flexibility. Several CPAN modules provide various types of encapsulation, access control, and dispatch. There are even refinements of the object system itself, exploring such techniques as prototype-based refinement, mixins, and traits.

There's more than one way to do it, but many of those ways are freely available and freely usable from the CPAN. The premier repository system of Perl libraries and components contains thousands of modules, from simple packagings of common idioms to huge interfaces to graphical packages, databases, and web servers. With few exceptions, the community of CPAN contributors have solved nearly any common problem you can think of (and many uncommon ones, too).

It's difficult to say whether Perl excels as a glue language because of the CPAN or that CPAN has succeeded because Perl excels as a glue language, but being able to munge data between two other programs, processes, libraries, or machines is highly useful. Perl's text processing powers have few peers. Sure, you can build the single perfect command-line consisting of several small CLI utilities, but it's rare to do it more cleanly or concisely than with Perl.

What's Wrong with Perl 5

Perl 5 isn't perfect, though, and some of its flaws are more apparent the closer Perl 6 comes to completion.

Perhaps the biggest imperfection of Perl 5 is its internals. Though much of the design is clever, there are also places of obsolescence and interdependence, as well as optimizations that no one remembers, but no one can delete without affecting too many other parts of the system. Refactoring an eleven-plus-year-old software project that runs on seventy-odd platforms and has to retain backwards compatibility with itself on many levels is daunting, and there are few people qualified to do it. It's also exceedingly difficult to recruit new people for such a task.

Backwards compatibility in general hampers Perl 5 in other ways. Even though stability of interface and behavior is good in many ways, baking in an almost-right idea makes it difficult to sell people on the absolutely right idea later, especially if it takes years to discover what the true solution really is. For example, the long-deprecated and long-denigrated pseudohash feature was, partly, a way to improve object orientation. However, the Perl 6 approach (using opaque objects) solves the same problem without introducing the complexity and performance problems that pseudohashes did.

As another example, it's much too late to remove formats from Perl 5 without breaking backwards compatibility from Perl 1. However, using formats requires the use of global variables (or scary trickery), with all of the associated maintainability and encapsulation problems.

This points to one of the most subtle flaws of Perl 5: its single implementation is its specification. Certainly there is a growing test suite that explores Perl's behavior in known situations, but too many of these tests exist to ensure that no one accidentally breaks an obscure feature of a particular implementation that no one really thought about but someone somewhere relies on in an important piece of code. You could recreate Perl from its tests--after a fashion.

Perl 6 will likely also use its test suite as its primary specification, but as Larry Wall puts it, "We're just trying to start with the right tests this time."

Even if the Perl 5 codebase did follow a specification, its design is inelegant in many places. It's also very difficult to expand. Many good ideas that would make code easier to write and maintain are too impractical to support. It's a good prototype, but it's not code that you would want to keep if you had the option to do something different.

From the language level, there are a few inconsistencies, as well. For example, why should sigils change depending on how you access internal data? (The canonical answer is "To specify context of the access," but there are other ways to mark the same.) When is a block a block, and when is it a hash reference? Why does SUPER method redispatch not respect the currently dispatched class of the invocant, but only the compiled class? How can you tell the indirect object notation's method name barewords from bareword class or function names?

It can be difficult to decide whether the problem with a certain feature is in the design or the implementation. Consider the desire to replace a built-in data structure with a user-defined object. Perl 5 requires you to use tie and overload to do so. To make this work, the internals check special flags on every data structure in every opcode to see if the current item has any magical behavior. This is ugly, slow, inflexible, and difficult to understand.

The Perl 6 solution is to allow multi-method dispatch, which not only removes conceptual complexity (at least, MMD is easier to explain than tie) but also provides the possibility of a cleaner implementation.

Perl's flexibility sometimes makes life difficult. In particular, there being multiple more-or-less equivalent ways to create objects gives people plenty of opportunities to do clever things they need to do, but it also means that people tend to choose the easiest (or sometimes cleverest) way to do something, not necessarily the best way to do something. It's not Perlish to allow only one way to perform a task, but there's no reason not to provide one really good and easy way to do something while providing the proper hooks and safety outlets to customize the solution cleanly.

Also, there are plenty of language optimizations that turned out to be wrong in the long term. Many of them were conventions--from pre-existing awk, shell, Unix, and regular expression cultures--that gave early Perl a familiarity and aided its initial growth. Yet now that Perl stands on its own, they can seem counter-productive.

Redesigning Perl means asking a lot of questions. Why is the method call operator two characters (one shifted), not a single dot? Why are strictures disabled by default in programs, not one-liners? Why does dereferencing a reference take so many characters? (Perl 5 overloaded curly braces in six different ways. If you can list four, you're doing well.) Why is evaluating a non-scalar container in scalar context so much less useful than it could be?

Once you accept that backwards compatibility is standing in the way of progress and resolve to change things for the better, you have a lot of opportunities to fix design and implementation decisions that turn out to have been bad--or at least, not completely correct.

Advantages of Perl 6

In exchange for breaking backwards compatibility, at least at the language level, Perl 6 offers plenty of high-powered language concepts that Perl 5 didn't support, including:

Multimethods
Coroutines
Continuations
Useful threading
Junctions
Roles
Hyperoperators
Macros
An overridable and reusable grammar
Garbage collection
Improved foreign function interface
Module aliasing and versioning
Improved introspection
Extensible and overridable primitives

Better Internals

The Parrot project, led by designer Chip Salzenberg and pumpking Leo Toetsch, is producing the new virtual machine for the official Perl 6 release.

Parrot is a new design and implementation not specifically tied to Perl 6. Its goal is to run almost any dynamic language efficiently. Because many of the designers have plenty of experience with the Perl 5 internals, Parrot tries to avoid the common mistakes and drawbacks there. One of the first and most important design decisions is extracting the logic of overridden container behavior from opcodes into the containers themselves. That is, where you might have a tied hash in Perl 5, all of the opcodes that deal with hashes have to check that the hash received is tied. In Parrot, each hash has a specific interface and all of the opcodes expect the PMC that they receive to implement that interface. (This is the standard "Replace conditional with polymorphism" refactoring.)

Better Object Orientation

The de facto OO technique in Perl 5 is blessing a hash and accessing the hash's members directly as attributes. This is quick and easy, but it has encapsulation, substitutability, and namespace clashing problems. Those problems all have solutions: witness several competing CPAN modules that solve them.

Perl 6 instead provides opaque objects by default, with language support for creating classes and instances and declaring class and instance attributes. It also provides multiple ways to customize class and object behavior, from instantiation to destruction. Where 95 percent of objects can happily use the defaults, the 5 percent customized classes will still work with the rest of the world.

Another compelling feature is language support for roles--this is a different way of describing and encapsulating specific behavior for objects apart from inheritance or mixins. In brief, a role encapsulates behavior that multiple classes can perform, so that a function or method signature can expect an object that does a role, rather than an object that inherits from a particular abstract base class. This has powerful effects on polymorphism and genericity. Having role support in the language and the core library will make large object-oriented systems easier to write and to maintain.

Improved Consistency

Sigils, the funny little markers at the start of variables, are invariant.

Return codes make sense, especially in exceptional cases.

Similar things look similar. Different things look different. Weird things look weird.

All blocks are closures; all closures are first-class data structures on which you can set or query properties, for example.

Rules and Grammars

One of Perl 5's most useful features is integrated regular expression support--except they're not all that regular anymore. Nearly every problem Perl 5 has in the whole (inconsistency, wrong shortcuts, difficult reusability, inflexible and impenetrable internals) shows up in the syntax and implementation of regular expressions.

Perl 6 simplifies regular expressions while adding more power, producing rules. You can reuse and combine rules to produce a grammar. If you apply a grammar to text (or, perhaps, any type of input including a recursive data structure), you receive a match tree.

That sounds quite a bit like what a parser and lexer do--so there's little surprise that Perl 6 has its own locally overridable grammar that allows you to make your own syntax changes and redefine the language when you really need to. Perl 5 supported a similar feature (source filters), but it was fragile, hard to use, and even harder to re-use in serious programs.

By making a clean break from regular expressions, the designers had the opportunity to re-examine the regex syntax. The new syntax is more consistent, so it's easier to type and to remember the syntaxes of common operations. There's also more consistency, so that similar features look similar.

Perl 6 has a Perl 5 compatibility layer, if you prefer quick and dirty and familiar--but give the new syntax a try, especially for projects where quick and dirty regular expressions were intractable (more than usual, anyway).

Where is it Already?

Larry announced the Perl 6 project at OSCON in 2000. Why is it taking so long? There are several reasons.

First, Perl 5 isn't going anywhere. If anything, the rate of patches and changes to the code has increased. Cleanups from Ponie and the Phalanx project continue to improve the design and implementation, and new features from Perl 6 are making their way into Perl 5.

Second, the opportunity to do the right thing without fear of breaking backwards compatibility opened up a lot of possibilities for impressive new features. Reinventing regular expressions as rules and grammars, for example, would have been difficult while retaining the flavor and syntax of awk and Henry Spencer's original implementations. The new power and consistency makes rules well worth the reinvention.

Third, the project is still a volunteer project. Though other languages and platforms have major corporate support, only a handful of Perl 6 hackers receive any form of funding to work on the project--and none of them on a full-time basis.

If you want to write actual, working Perl 6 code, it's possible. Pugs has been able to run quite a bit of the language since last summer. It will soon connect directly to Parrot again. When that happens, watch out!

Learning More

This article is merely an overview of some of the reasons for and features of Perl 6. There are plenty of details available online in writings of the designers, the mailing lists, and the source code repositories.

Design Documents

The Perl 6 home page holds links to most of the design documents for the language. In particular, Larry's Perl 6 Apocalypses explore a subject area in depth, identifying the problem and outlining his thinking about what the solution might be. Damian Conway's Perl 6 Exegeses expand upon the idea, showing concrete examples written in actual Perl 6 code.

In the past several months, the design team has started to update the Perl 6 Synopses instead. Perl 6 pumpking Patrick Michaud keeps these fresh with the current design. The Apocalypses and Exegeses remain online as interesting historical documents that take too long to write and revise as changes occur.

Implementations

Parrot has monthly releases. The Parrot distribution includes the Parrot Grammar Engine (PGE), which is Patrick's implementation of rules and grammars, as well as several languages that target Parrot. The most complete implementation is for Tcl, though the Punie project (Perl 1 on Parrot) shows the entire suite of compiler tools.

Audrey (nee Autrijus) Tang's Pugs is an unofficial Perl 6 implementation, optimized for fun. As of the time of the writing, it supported much of Perl 6, including junctions, multimethods, and objects. It targets multiple back-ends, including Haskell, JavaScript, Perl 5, and Parrot, and moves very quickly. Pugs is a great project in which to participate--it's very easy to get a committer bit and start writing tests and fixing bugs. It's currently the main prototype and reference implementation. Time will tell what its role is in the final release.

Ponie is a port of Perl 5 to Parrot. It's a huge refactoring project with little glory but a lot of potential usefulness. C hackers are more than welcome.

Discussion

Most development discussion takes place on several Perl 6 mailing lists:

discusses Perl 6, the language and features.
discusses the design and implementation of Parrot and various languages targeting Parrot.
discusses PGE, Pugs, and the interaction of various components of the compiler tools.

The #perl6 IRC channel on irc.freenode.net talks about Pugs and Perl 6, while #parrot on irc.perl.org concentrates on Parrot. There is almost always someone around in #perl6 to answer questions about Pugs or Perl 6.

Planet Perl Six aggregates weblogs from several designers and developers of various related projects.

Lexing Your Data

By Curtis Poe on January 5, 2006 12:00 AM

`s/(?<!SHOOTING YOURSELF IN THE )FOOT/HEAD/g`

Most of us have tried at one time or another to use regular expressions to do things we shouldn't: parsing HTML, obfuscating code, washing dishes, etc. This is what the technical term "showing off" means. I've done it too:

$html =~ s{
             (<a\s(?:[^>](?!href))*href\s*)
             (&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)
             ([^>]+>)
          }
          {$1 . decode_entities($2) .  $4}gsexi;

I was strutting like a peacock when I wrote that, followed quickly by eating crow when I ran it. I never did get that working right. I'm still not sure what I was trying to do. That regular expression forced me to learn how to use HTML::TokeParser. More importantly, that was the regular expression that taught me how difficult regular expressions can be.

The Problem with Regular Expressions

Look at that regex again:

 /(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)([^>]+>)/

Do you know that matches? Exactly? Are you sure? Even if it works, how easily can you modify it? If you don't know what it was trying to do (and to be fair, don't forget it's broken), how long did you spend trying to figure it out? When's the last time a single line of code gave you such fits?

The problem, of course, is that this regular expression is trying to do far more work than a single line of code is likely to do. When facing with a regular expression like that, there are a few things I like to do.

Document it carefully.
Use the /x switch so I can expand it over several lines.
Possibly, encapsulate it in a subroutine.

Sometimes, though, there's a fourth option: lexing.

Lexing

When developing code, we typically take a problem and break it down into a series of smaller problems that are easier to solve. Regular expressions are code and you can break them down into a series of smaller problems that are easier to solve. One technique is to use lexing to facilitate this.

Lexing is the act of breaking data down into discrete tokens and assigning meaning to those tokens. There's a bit of fudging in that statement, but it pretty much covers the basics.

Parsing typically follows lexing to convert the tokens into something more useful. Parsing is frequently the domain of some tool that applies a well-defined grammar to the lexed tokens.

Sometimes well-defined grammars are not practical for extracting and reporting information. There might not be a grammar available for a company's ad-hoc log file format. Other times you might find it easier to process the tokens manually then to spend the time writing a grammar. Still other times you might only care about part of the data you've lexed, not all of it. All three of these reasons apply to some problems.

Parsing SQL

Recently, on Perlmonks (parse a query string), someone had some SQL to parse:

select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
  when a=b then 'c'
  else 'd'
  end tough_one
from ...
where ...

The poster needed the alias for each column from that SQL. In this case, the aliases are date, months_old, product, year, and tough_one. Of course, this was only one example. There's actually plenty of generated SQL, all with subtle variations on the column aliases, so this is not a trivial task. What's interesting about this, though, is that we don't give a fig about anything except the column aliases. The rest of the text is merely there to help us find those aliases.

Your first thought might be to parse this with SQL::Statement. As it turns out, this module does not handle CASE statements. Thus, you must figure out how to patch SQL::Statement, submit said patch, and hope it gets accepted and released in a timely fashion. (Note that SQL::Statement uses SQL::Parser, so the latter is also not an option.)

Second, many of us have worked in environments where we have problems to solve in production now, but we still have to wait three weeks to get the necessary modules installed, if we can get them approved at all.

The most important reason, though, is even if SQL::Statement could handle this problem, this would be an awfully short article if you used it instead of a lexer.

Lexing Basics

As mentioned earlier, lexing is essentially the task of analyzing data and breaking it down into a series of easy-to-use tokens. While the data may be in other forms, usually this means analyzing strings. To give a trivial example, consider the expression:

x = (3 + 2) / y

When lexed, you might get a series of tokens, such as:

my @tokens = (
  [ OP  => 'x' ],
  [ OP  => '=' ],
  [ OP  => '(' ],
  [ INT => '3' ],
  [ VAR => '+' ],
  [ INT => '2' ],
  [ OP  => ')' ],
  [ OP  => '/' ],
  [ VAR => 'y' ],
);

With a proper grammar, you could then read this series of tokens and take actions based upon their values, perhaps to build a simple language interpreter or translate this code into another programming language. Even without a grammar, you can find these tokens useful.

Identifying Tokens

The first step in building a lexer is identifying the tokens you wish to parse. Look again at the SQL.

select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
  when a=b then 'c'
    else 'd'
  end tough_one
from ...
where ...

There's nothing really to care about anything after the from keyword. In looking at this closer, everything you do care about is immediately prior to a comma or the from keyword. However, splitting on commas isn't enough, as there are some commas embedded in function parentheses.

The first thing to do is to identify the various things you can match with simple regular expressions.

These "things" appear to be parentheses, commas, operators, keywords, and random text. A first pass at it might look something like this:

my $lparen  = qr/\(/;
my $rparen  = qr/\)/;
my $keyword = qr/(?i:select|from|as)/; # this is all this problem needs
my $comma   = qr/,/;
my $text    = qr/(?:\w+|'\w+'|"\w+")/;
my $op      = qr{[-=+*/<>]};

The text matching is somewhat naive and you might want Regexp::Common for some of the regular expressions, but keep this simple for now.

The operators are a bit more involved. Assume that some SQL might have math statements embedded in them.

Now create the actual lexer. One way to do this is to make your own. It might look something like this:

sub lexer {
    my $sql = shift;
    return sub {
        LEXER: {
            return ['KEYWORD', $1] if $sql =~ /\G ($keyword) /gcx;
            return ['COMMA',   ''] if $sql =~ /\G ($comma)   /gcx;
            return ['OP',      $1] if $sql =~ /\G ($op)      /gcx;
            return ['PAREN',    1] if $sql =~ /\G $lparen    /gcx;
            return ['PAREN',   -1] if $sql =~ /\G $rparen    /gcx;
            return ['TEXT',    $1] if $sql =~ /\G ($text)    /gcx;
            redo LEXER             if $sql =~ /\G \s+        /gcx;
        }
    };
}

my $lexer = lexer($sql);

while (defined (my $token = $lexer->())) {
    # do something with the token
}

Without going into the detail of how that works, it's fair to say that this is not the best solution. By looking at the original Perlmonks post, you should find that you need to make two passes through the data to extract what you want. I've left the explanation an exercise for the reader.

To make this simpler, use the HOP::Lexer module from the CPAN. This module, described by Mark Jason Dominus in his book Higher Order Perl, makes creating lexers a rather trivial task and makes them a bit more powerful than the example. Here's the new code:

use HOP::Lexer 'make_lexer';
my @sql   = $sql;
my $lexer = make_lexer(
    sub { shift @sql },
    [ 'KEYWORD', qr/(?i:select|from|as)/          ],
    [ 'COMMA',   qr/,/                            ],
    [ 'OP',      qr{[-=+*/]}                      ],
    [ 'PAREN',   qr/\(/,      sub { [shift,  1] } ],
    [ 'PAREN',   qr/\)/,      sub { [shift, -1] } ],
    [ 'TEXT',    qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
    [ 'SPACE',   qr/\s*/,     sub {}              ],
);

sub text {
    my ($label, $value) = @_;
    $value =~ s/^["']//;
    $value =~ s/["']$//;
    return [ $label, $value ];
}

This certainly doesn't look any easier to read, but bear with me.

The make_lexer subroutine takes as its first argument an iterator, which returns the text to match on every call. In this case, you only have one snippet of text to match, so merely shift it off of an array. If you were reading lines from a log file, the iterator would be quite handy.

After the first argument comes a series of array references. Each reference takes two mandatory and one optional argument(s):

[ $label, $pattern, $optional_subroutine ]

The $label is the name of the token. The pattern should match whatever the label identifies. The third argument, a subroutine reference, takes as arguments the label and the text the label matched, and returns whatever you wish for a token.

Consider how you typically use the make_lexer subroutine.

[ 'KEYWORD', qr/(?i:select|from|as)/ ],

Here's an example of how to transform the data before making the token:

[ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text  ],

As mentioned previously, the regular expression might be naive, but leave that for now and focus on the &text subroutine.

sub text {
    my ($label, $value) = @_;
    $value =~ s/^["']//;
    $value =~ s/["']$//;
    return [ $label, $value ];
}

This says, "Take the label and the value, strip leading and trailing quotes from the value and return them in an array reference."

To strip the white space you don't care about, simply return nothing:

 'SPACE', qr/\s*/, sub {} ],

Now that you have your lexer, put it to work. Remember that column aliases are the TEXT not in parentheses, but immediately prior to commas or the from keyword. How do we know if you're inside of parentheses? Cheat a little bit:

[ 'PAREN', qr/\(/, sub { [shift,  1] } ],
[ 'PAREN', qr/\)/, sub { [shift, -1] } ],

With that, you can add a one whenever you get to an opening parenthesis and subtract it when you get to a closing one. Whenever the result is zero, you know that you're outside of parentheses.

To get the tokens, call the $lexer iterator repeatedly.

while ( defined (my $token = $lexer->() ) { ... }

The tokens look like this:

[  'KEYWORD',      'select' ]
[  'TEXT',       'the_date' ]
[  'KEYWORD',          'as' ]
[  'TEXT',           'date' ]
[  'COMMA',             ',' ]
[  'TEXT',          'round' ]
[  'PAREN',               1 ]
[  'TEXT', 'months_between' ]
[  'PAREN',               1 ]

And so on.

Here's how to process the tokens:

 1:  my $inside_parens = 0;
 2:  while ( defined (my $token = $lexer->()) ) {
 3:      my ($label, $value) = @$token;
 4:      $inside_parens += $value if 'PAREN' eq $label;
 5:      next if $inside_parens || 'TEXT' ne $label;
 6:      if (defined (my $next = $lexer->('peek'))) {
 7:          my ($next_label, $next_value) = @$next;
 8:          if ('COMMA' eq $next_label) {
 9:              print "$value\n";
10:          }
11:          elsif ('KEYWORD' eq $next_label && 'from' eq $next_value) {
12:              print "$value\n";
13:              last; # we're done
14:          }
15:      }
16:  }

This is pretty straightforward, but there are some tricky bits. Each token is a two-element array reference, so line 3 makes the label and value fairly explicit. Lines 4 and 5 use the "cheat" for handling parentheses. Line 5 also skips anything that isn't text and therefore cannot be a column alias.

Line 6 is a bit odd. In HOP::Lexer, passing the string peek to the lexer will return the next token without actually advancing the $lexer iterator. From there, it's straightforward logic to find out if the value is a column alias that matches the criteria.

Putting all of this together makes:

#!/usr/bin/perl

use strict;
use warnings;
use HOP::Lexer 'make_lexer';

my $sql = <<END_SQL;
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
  when a=b then 'c'
    else 'd'
      end tough_one
      from XXX
END_SQL

my @sql   = $sql;
my $lexer = make_lexer(
    sub { shift @sql },
    [ 'KEYWORD', qr/(?i:select|from|as)/          ],
    [ 'COMMA',   qr/,/                            ],
    [ 'OP',      qr{[-=+*/]}                      ],
    [ 'PAREN',   qr/\(/,      sub { [shift,  1] } ],
    [ 'PAREN',   qr/\)/,      sub { [shift, -1] } ],
    [ 'TEXT',    qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
    [ 'SPACE',   qr/\s*/,     sub {}              ],
);

sub text {
    my ( $label, $value ) = @_;
    $value =~ s/^["']//;
    $value =~ s/["']$//;
    return [ $label, $value ];
}

my $inside_parens = 0;
while ( defined ( my $token = $lexer->() ) ) {
    my ( $label, $value ) = @$token;
    $inside_parens += $value if 'PAREN' eq $label;
    next if $inside_parens || 'TEXT' ne $label;
    if ( defined ( my $next = $lexer->('peek') ) ) {
        my ( $next_label, $next_value ) = @$next;
        if ( 'COMMA' eq $next_label ) {
            print "$value\n";
        }
        elsif ( 'KEYWORD' eq $next_label && 'from' eq $next_value ) {
            print "$value\n";
            last; # we're done
        }
    }
}

That prints out the column aliases:

date
months_old
product
year
tough_one

So are you done? No, probably not. What you really need now are many other examples of the SQL generated in the first problem statement. Maybe the &text subroutine is naive. Maybe there are other operators you forgot. Maybe there are floating-point numbers embedded in the SQL. When you have to lex data by hand, fine-tuning the lexer to match your actual data can take a few tries.

It's also important to note that precedence is very important here. &make_lexer evaluates each array reference passed in the order it receives them. If you passed the TEXT array reference before the KEYWORD array reference, the TEXT regular expression would match keywords before the KEYWORD could, thus generating spurious results.

Happy lexing!

« December 2005 | Main Index | Archives | February 2006 »