Previous Table of Contents Next


Parentheses

You’ve probably noticed that parentheses are used to isolate expressions in Perl, just as in mathematics. They do the same thing (and a little bit more) inside regular expressions. When a regex contains parentheses, Perl remembers which substrings matched the parts inside, and returns a list of them when evaluated in an array context (e.g., @array = /(\w+) (\w+) (\w+)/)


Parentheses
There are two reasons to enclose parts of a regular expression in parentheses:
  Parentheses group regex characters together (so special characters such as + and * can apply to the entire group).
  Parentheses create backreferences so you can determine which portion of the string matched the group.

Grouping

Let’s say you want a program to recognize “funny words.” A funny word is a word starting with an m, b, or bl, followed by an a, e, or o repeated at least three times, and ending in an m, p, or b. Here are some funny words:

meeeeeep
booooooooom
blaaap

A pattern that matches funny words will have three parts.

1. m|b|bl The prefix: m or b or bl
2. a{3,}|e{3,}|o{3,} The middle: 3 or more as, es, or os
3. m|p|b The suffix: m or p or b

You can’t just concatenate these segments into

/m|b|bla{3,}|e{3,}|o{3,}m|p|b/

because the wrong chunks get ORed together. Instead, you should group the segments properly with parentheses.

/(m|b|bl)(a{3,}|e{3,}|o{3,})(m|p|b)/

You can use parentheses, yet again, to apply the {3,} to each of a, e, and o.

/(m|b|bl)(a|e|o){3,}(m|p|b)/

Backreferences

Parentheses do something else, too: Each pair of parentheses creates a temporary variable, called a backreference, containing whatever matched inside (Figure 2-6). In an array context, each match returns an array of all the backreferences.


Figure 2-6  Backreferences “refer back” to parenthetical matches

($word) = ($line =~ /(\S+)/);

places the first word of $line into $word if there is one. If not, $word is set to FALSE.

($word1, $word2) = ($line =~ /(\S+)/);

places the first two words of $line into $word1 and $word2. If there aren’t two words, neither $word1 nor $word2 is set.

$word = ($line =~ /(\S+)/);

is a scalar context, so the parentheses don’t matter. $word is set to 1 if $line contains a word and FALSE otherwise.

Backreferences are also stored in the temporary variables $1, $2, $3, and so forth. $1 contains whatever substring matched the first pair of parentheses, $2 contains whatever substring matched the second pair of parentheses, and so on, as high as you need.

The regular expression /(\S+)\s+(\S+)/ contains two pairs of parentheses, so two backreferences are created for any matching string. twoword, shown in Listing 2-10, shows how this can be handled.

Listing 2-10 twoword: Using backreferences

#!/usr/bin/perl -w

$_ = shift;

/(\S+)\s+(\S+)/;                  # two chunks of non-space

print "The first word is $1 \n";  # The substring matching the first (\S+).
print "The second word is $2 \n"; # The substring matching the second
                                    (\S+).

% twoword "hello there"
RESULT: The first word is hello
The second word is there

nameswap, shown in Listing 2-11, swaps the user’s first and last names.


Previous Table of Contents Next