| Previous | Table of Contents | Next |
Youve probably noticed that parentheses are used to isolate expressions in Perl, just as in mathematics. They do the same thing (and a little bit more) inside regular expressions. When a regex contains parentheses, Perl remembers which substrings matched the parts inside, and returns a list of them when evaluated in an array context (e.g., @array = /(\w+) (\w+) (\w+)/)
Parentheses
There are two reasons to enclose parts of a regular expression in parentheses:
- Parentheses group regex characters together (so special characters such as + and * can apply to the entire group).
- Parentheses create backreferences so you can determine which portion of the string matched the group.
Grouping
Lets say you want a program to recognize funny words. A funny word is a word starting with an m, b, or bl, followed by an a, e, or o repeated at least three times, and ending in an m, p, or b. Here are some funny words:
meeeeeep booooooooom blaaap
A pattern that matches funny words will have three parts.
| 1. m|b|bl | The prefix: m or b or bl |
| 2. a{3,}|e{3,}|o{3,} | The middle: 3 or more as, es, or os |
| 3. m|p|b | The suffix: m or p or b |
You cant just concatenate these segments into
/m|b|bla{3,}|e{3,}|o{3,}m|p|b/
because the wrong chunks get ORed together. Instead, you should group the segments properly with parentheses.
/(m|b|bl)(a{3,}|e{3,}|o{3,})(m|p|b)/
You can use parentheses, yet again, to apply the {3,} to each of a, e, and o.
/(m|b|bl)(a|e|o){3,}(m|p|b)/
Backreferences
Parentheses do something else, too: Each pair of parentheses creates a temporary variable, called a backreference, containing whatever matched inside (Figure 2-6). In an array context, each match returns an array of all the backreferences.
Figure 2-6 Backreferences refer back to parenthetical matches
($word) = ($line =~ /(\S+)/);
places the first word of $line into $word if there is one. If not, $word is set to FALSE.
($word1, $word2) = ($line =~ /(\S+)/);
places the first two words of $line into $word1 and $word2. If there arent two words, neither $word1 nor $word2 is set.
$word = ($line =~ /(\S+)/);
is a scalar context, so the parentheses dont matter. $word is set to 1 if $line contains a word and FALSE otherwise.
Backreferences are also stored in the temporary variables $1, $2, $3, and so forth. $1 contains whatever substring matched the first pair of parentheses, $2 contains whatever substring matched the second pair of parentheses, and so on, as high as you need.
The regular expression /(\S+)\s+(\S+)/ contains two pairs of parentheses, so two backreferences are created for any matching string. twoword, shown in Listing 2-10, shows how this can be handled.
Listing 2-10 twoword: Using backreferences
#!/usr/bin/perl -w
$_ = shift;
/(\S+)\s+(\S+)/; # two chunks of non-space
print "The first word is $1 \n"; # The substring matching the first (\S+).
print "The second word is $2 \n"; # The substring matching the second
(\S+).
% twoword "hello there"
RESULT: The first word is hello
The second word is there
nameswap, shown in Listing 2-11, swaps the users first and last names.
| Previous | Table of Contents | Next |