| Previous | Table of Contents | Next |
Many tasks require recognizing patterns in text: extracting a column of text from the output of a command, replacing certain strings in text with others, selecting elements from an array that fulfill some condition. All use patterns that match some pieces of data but not others. In Perl, patterns can be represented compactly and efficiently as regular epressions.
Perls regular expressions make it possible to write powerful but concise programsoften as short as one or two linesthat require dozens or even hundreds of lines in other programming languages. Regular expressions are the heart and soul of Perl, and the source of Perls reputation for being useful but cryptic.
In a sense, regular expressions constitute a little programming language of their own. When you construct a regular expression, you combine symbols (each of which is a simple pattern) into a more complicated structure that defines a more sophisticated pattern. Think of these symbols (and there are a lot of them in this chapter) as elemental building blocks; you can sculpt them into architectural marvels, as long as you arrange them very precisely.
Heres a teaser, to show you the power of regular expressions. The command
perl -p -e 's/\b([^aeiouy]*)(\S+)\s?/$2$1ay /gi'
converts text into pig Latin.
echo "pattern match" | perl -p -e 's/\b([^aeiouy]*)(\S+)\s?/$2$1ay /gi'
prints
atternpay atchmay
What more could you ask for?
Regular expressions (often called regexes) are patterns of characters. When you construct a regular expression, youre defining a pattern (or filter) that accepts certain strings and rejects all others (Figure 2-1).
Figure 2-1 Patterns are selective, like bouncers at a nightclub: They accept some strings and reject others
Table 2-1 shows a few regular expressions. Eyeball each regex on the left and see if you can infer the rule: Why does the pattern match the strings in the middle column, but not the strings in the right column? All the funny characters that define the pattern (*, +, \s, |, and so on) will be explained in greater detail throughout this chapter (plus theres a summary in Appendix I), so think of this as a puzzle, nothing more.
| Pattern | Some Strings That Match the Patter | Some Strings That Dont |
|---|---|---|
| /fossil/ | fossil and fossiliferous | ossify and fosssssil |
| /bo*/ | b and bo and booo | ooooo and splat |
| /a(ha)*/ | a and aha and ahaha and ahahaha... | ho and hhhh |
| /hm+/ | hm and hmm and hmmm... | h and m |
| /yooho{2,4}/ | yoohoo and yoohooo and yoohoooo | yooho |
| /Joh?n/ | Jon and John | Jan and Johhn |
| /black|white/ | black and white | green |
| /high\s+low/ | high low (with some whitespace in between) | highlow and highXXXlow |
| /high\S+low/ | highXXXlow (with some non-whitespace in between.) | highlow and high low |
| /\$\d\.\d\d/ | $1.01 and $8.95 and $6.70... | 23 and $19.95 |
| /.../ | any three characters | no and hi and we and us |
As you can see, regular expressions are usually enclosed in slashes. The various special characters (., *, +, ?, {}) and metacharacters (\d, \s, \S) in Table 2-1 each change the regex behavior in their own way, affecting which strings match and regex and which strings dont.
| Previous | Table of Contents | Next |