Diving into the Perl Regular Expression Engine

时间：2009-04-29 来源：cobrawgl

Basic regex functionality and syntax using Perl

Perl is quite possibly best known for its regular expression engine, hereon referred to as the regex engine or simply "the engine." Perl's regular expression engine is an incredibly powerful tool. I will assume that the reader has a basic knowledge of how perl handles variable interpolation, loop structure, syntax, and a general idea of what a regular expression parser does. That being said, let's move on.

We will begin with perl's powerful m//, the pattern matching operator. The m// operator may be bound to any scalar value with the =~ operator. The return value of which is boolean: 1 if the pattern is matched and undef if not. Figure 1 will hopefully help you understand the basic structure of the m// operator.

Fig. 1
$BOOLEAN = ($STRING_TO_MATCH =~ m/PATTERN/);

The parenthesis aren't necessary as the =~ operator commands higher precedence than that of the = operator. So Fig. 1 could be expressed in English by saying, "If PATTERN is found anywhere within $STRING_TO_MATCH, return 1 to $BOOLEAN." This becomes extremely powerful when used in a manner such as Fig. 2:

Fig. 2
while(<INPUT>){
    if($_=~m/PATTERN/){
        $count++;}
}
print "I found $count instances of that pattern within the file.\n"

The m// operator (and the s/// operator as you'll see later) will default to $_ if the scalar value is not specified, as anyone with any perl experience might assume. So one could express Fig. 2 more simply in the way of Fig. 3:

Fig. 3
while(<INPUT>){
    if(m/PATTERN/){
        $count++;}
}

Keep in mind, however, that the return value of the regular expression is boolean. That is to say that if you had a .txt file that read lines like Fig. 4, you might be surprised to find that the count of matches is short of the actual matches! Fig. 3 would return a count of two matches from Fig. 4, one for the match of each line.

Fig. 4
boo boo be doo
boo

We can slim Fig. 3 down even further by stripping the m from m//, when using forward slashes for the operator. Therefore, /PATTERN/ is the same as saying m/PATTERN/. Later, we'll explain the use of other delimiters.

Now, let's talk about interpolation and double quote context. The m// and s/// operators interpolate all characters contained within the delimiters as the perl interpreter would do in double quoted strings. Keep in mind that the engine interprets everything within the delimiters literally. Therefore one should avoid using special characters within the delimiters unless one expressly intends the engine to interpret the special characters as metacharacters (more on that later). Also, the same backslash rules apply to the regular expression operators within the delimiters. Therefore, one might look for a url by expressing something like Fig. 5.

Fig. 5
if(/http:\/\/www\.osix\.net/){
    print "found the osix homepage!\n";}

That is to say that /, being what we call a metacharacter, must be preceded by a backslash if one intends the engine to interpret it literally. Below is a comprehensive list of metacharacters of which one must remain aware. You'll notice the backslash in the list as well; yes, we would express that literally as \\.

\|()[{^$+?.

A few of these metacharacters beg to be analyzed right now. Firstly, the . metacharacter is a wildcard that matches any character except the newline (\n). Secondly, the | metacharacter may be used as an OR within a regex. For instance,

m/(foo)|(bar)/ ## will return 1 if the engine finds "foo" or "bar" within the value.

The ^ and $ metacharacters are true at the beginning and end of a line, respectively. The other metacharacters are reserved for other uses that we'll describe later. One might then wonder what a character preceded by a backslash might imply. When one sees an alphanumeric character that is preceded by a backslash, one should know that it becomes a metacharacter or sequence. We call these metasymbols, and we will review each of these a little later as well.

A couple of these metacharacters are used to quantify pattern matching. We can match multiples or additional instances of patterns by using the metacharacters as shown in Fig. 6. When the quantifiers are used with the parenthesis as grouping agents, the engine becomes particularly powerful.

Fig. 6
/(foo){3}/
/foo{3}/
/foo{3,}/
/foo{3,7}/
/(foo)*/
/(foo)*?/
/(foo)+/
/(foo)+?/

Line one would return true if the engine found "foofoofoo" anywhere, while line 2 would return true if "foooo" were found within the string. As {x} matches exactly x times, {x,} matches at least x times, and {x,y} matches at least x and no more than y times. * is found to be true if the engine finds 0 or more instances of the pattern were found within the string, and + is found to be true if 1 or more instances of the pattern were found. ? is a metacharacter that becomes a lot more helpful when dealing with the s/// operator, but we'll review it now and implement it later. The ? metacharacter turns the quantifier into what we call a minimal quantifier. The perl engine naturally wants to eat up as much of the string as possible while looking for a match for the pattern. We'll discuss the greedy nature of the engine later.

One may choose to use another metacharacter as a delimiter, but one should remember to sharpen such an expression by using the m or s. This becomes particularly handy when trying to run patterns through the regex that include a number of frontslashes; i.e. url's, filepaths, etc.

m#/usr#/usr/bin#

or

m#http://www\.osix\.net/index\.html#

Let's talk modifiers, shall we? Conveniently m// and s/// use the same modifiers, but s/// uses some additional modifiers that we'll discuss later. Modifiers are used at the end of the regex delimiters and are piled up one after another following the frontslash. Let's discuss each modifier briefly one at a time. /i ignores any alphabetic case, so

m/fOoBaR/i

would match "FooBar", "foobaR", and "FoObAr" alike. /s lets . match newlines and ignores the deprecated $* variable. /m lets ^ and $ match next to \n characters. That is to say that /s and /m should be used when you expressly wish the engine to assume that the stringis a single line or comprised of multiple lines, respectively. /x ignores whitespace and permits comments within the pattern. /o compiles the pattern just once when interpolating variables within the patterns. /o becomes helpful when attempting to streamline a regex expression that interpolates a variable that will not change value from iteration to iteration. Therefore, it might be helpful in the case of Fig. 7 but not in the case of Fig. 8:

Fig. 7
my $pattern = "foo";
while(<INPUT>){
    if(/$pattern/o){
        print "found foo!\n";}}

Fig. 8
for($counter=0;$counter<5;$counter++){
    $pattern=<STDIN>;
    if($string_to_match=~/$pattern/){
        print "found foo!\n";}}

If we were to use the /o modifier in Fig. 8, only the first input pattern would be hunted through the string in each iteration of the while loop.

Character classes are a handy tool implemented by the regex engine. Character classes define a set of characters with properties as similar as the programmer intends them to be. They are grouped using the [] brackets and may be used with a - range operator (.. wouldn't make a lot of sense, would it?) You can see some instances of character class definitions below in Fig. 9.

Fig. 9
m/[a-zA-Z]/ ## easy way to search for any case-insensitive alpha
m/[0-9a-z]/i ## easy way to search for any alphanumeric
m/.*?\.([0-9]{3})/ ## looking for a floating point with at least three digits to the right of the decimal.

Let's now discuss those special metacharacters before moving on to the s/// operator. Below is a list of what we call metasymbols; the backslashed alphanumerics meant to imply metacharacters or sequences

open OREILLY, "<", "O\'Reilly, Programming Perl\; Third Edition";
while(<OREILLY>){
\0 match null character
\NNN match the character given in octal, up to \377
\n match nth previously captured string
\b match the backspace character
\b true at word boundary
\B true when not at word boundary
\cX match the control character Control-X
\C match one byte (C char)
\d match any digit character
\D match any non-digit
\e match ESC
\E end case translation (\L, \U)
\f match form feed character
\G true at end-of-match position of prior m//g
\l lowercase the next character only
\L lowercase till \E
\n match the newline character
\N{NAME} match named character
\p{PROP} match any character with the named property
\Q de-meta metacharacters till \E
\r match return character
\s match any whitespace character
\S match any whitespace character
\t match tab
\u titlecase next character only
\U uppercase till \E
\w match any word character (alphanumerics plus "_")
\W match any nonword character
\x{abcd} match the character given in hexadecimal
\X match unicode "combining character sequence" string
\z true at end of string only
\Z true at end of string or before optional newline"
}

Most of these are pretty self-explanatory, so I won't go into a whole lot of detail describing each one. Let's just review a few of the properties of these metasymbols. Firstly, only metasymbols that match characters may be used in character classes. I'm not even going to give any examples of violations of this rule; if you're not looking for something in particular, don't put it in a character class.

Ok, let's move on to the s/// operator. It's nice to be able to match a pattern easily and thoroughly, but don't you sometimes want to make changes, revisions, etc. to a given string? Sure you would! For that particular job we have the s/// operator. The s/// operator shares many of the same properties of the m// operator. They both interpolate variables within the delimiters as would any string within double quotes, one may group using parenthesis, and one can use character classes, quantifiers, modifiers, and metasymbols within the s/// operator much like m//. There is some added functionality, however, and a few dissimilarities between m// and s/// that warrant mention.

Firstly, let's discuss grouping within the s/// operator. We are allowed three really cool variables with each regex, $`, $&, and $', that allow us to make substitutions before, for, and after the pattern, respectively. Fig. 10 contains some examples of these variables in use.

Fig. 10
s/bar/$`/ ## assuming $_ was "foobar", $_ would become "foofoo"
s/be/$'/ ## assuming $_ was "doobedoo", $_ would become "doodoodoo"
s#http://www\.osix\.net/#$'## would replace any http url from that host with a relative path

In addition to these built-in variables, we can use any number of sequenced, built-in variables that refer to groups. They are used simply by stating, $1, $2, $3, and so on and so forth. In this way we can refer to groups within a regex relative to their order. This makes order modifications really simple in the way of Fig. 11:

Fig. 11
s/(foo)(bar)/$2 $1d/ ##makes "foobar" "bar food"

When using nested grouping, each group will be numbered by the opening parenthesis. So one would group the expression below as such...

s/((\w)(\w))// ## $1 will match two word boundaries, $2 will match the first word boundary, and $3 will match the second

As the return value of the regex is boolean when using the m// operator, the s/// operator returns the number of successful substitutions. However, one may play with the return value by grouping the string away from the =~ operator, giving the return value higher precedence. If you want to keep the modified lvalue for later use, you will have to pass it off in this manner. For instance...

($newstring = $oldstring)=~s/pattern/substitution/;

There are two additional modifiers that apply to the s/// operator that we should discuss. The first, being the /g modifier, makes substitutions for all occurrences of the match. One should become really familiar with this operator in a short while. The second modifier, /e, is one of the coolest so far; it tells the engine to treat the substitution field as perl code. Fig. 12 and Fig. 13 are both examples of common usage of the /e modifier, but watch what happens when we aren't careful to use the /g modifier.

Fig. 12
while(<INPUT>){
        s/pattern/print OUTPUT "$&\n";/e;}

Fig. 13
while(<INPUT>){
        s/pattern/print OUTPUT "$&\n";/eg;}

If there were multiple instances of "pattern" within $_, the engine would ignore them and move on. Instead, only the first instance of the pattern within $_ would trigger the substitution (in this case, a print expression, thanks to /e). As we mentioned that the return value of the s/// operator is the number of successful substitutions made, it's important to note that this would only return 1 or undef unless used with the /g modifier.

The quantifiers are what everyone likes to refer to as greedy in nature. More specifically *,+, and {} when used with a minimum value alone. Let's examine a few expressions to analyze the nature of the quantifiers. When used alone, *, +, and {} attempt to eat up as much of the string as possible to match the pattern. Often times you will find yourself wondering why the regex didn't match some part of the string that you were sure it was supposed to find. You might be surprised to find that it is because you used the maximal quantifier, thusly eating up the bit you wanted to match. Make sure to use the ? operator when quantifying if you intend to match minimally.

"temper tantrum"=~s/t(.*)t/print "$1\n"/e; ## would print "emper tan"
"temper tantrum"=~s/t(.*?)t/print "$1\n"/e; ## would print "emper "

Well that about covers the basic syntax and functionality of perl's regex engine. If you have any questions or wish to delve deeper into the workings of the engine, refer to cpan or email me.