Previous Table of Contents Next


CHAPTER 2
PATTERN MATCHING: REGULAR EXPRESSIONS

Many tasks require recognizing patterns in text: extracting a column of text from the output of a command, replacing certain strings in text with others, selecting elements from an array that fulfill some condition. All use patterns that match some pieces of data but not others. In Perl, patterns can be represented compactly and efficiently as regular epressions.

Perl’s regular expressions make it possible to write powerful but concise programs—often as short as one or two lines—that require dozens or even hundreds of lines in other programming languages. Regular expressions are the heart and soul of Perl, and the source of Perl’s reputation for being useful but cryptic.

In a sense, regular expressions constitute a little programming language of their own. When you construct a regular expression, you combine symbols (each of which is a simple pattern) into a more complicated structure that defines a more sophisticated pattern. Think of these symbols (and there are a lot of them in this chapter) as elemental building blocks; you can sculpt them into architectural marvels, as long as you arrange them very precisely.

Here’s a teaser, to show you the power of regular expressions. The command

perl -p -e 's/\b([^aeiouy]*)(\S+)\s?/$2$1ay /gi'

converts text into pig Latin.

echo "pattern match" | perl -p -e 's/\b([^aeiouy]*)(\S+)\s?/$2$1ay /gi'

prints

atternpay atchmay

What more could you ask for?

Session 1
Regular Expressions

Regular expressions (often called regexes) are patterns of characters. When you construct a regular expression, you’re defining a pattern (or “filter”) that accepts certain strings and rejects all others (Figure 2-1).


Figure 2-1  Patterns are selective, like bouncers at a nightclub: They accept some strings and reject others

Table 2-1 shows a few regular expressions. Eyeball each regex on the left and see if you can infer the rule: Why does the pattern match the strings in the middle column, but not the strings in the right column? All the funny characters that define the pattern (*, +, \s, |, and so on) will be explained in greater detail throughout this chapter (plus there’s a summary in Appendix I), so think of this as a puzzle, nothing more.

Table 2-1 Some regular expressions

Pattern Some Strings That Match the Patter Some Strings That Don’t

/fossil/ fossil and fossiliferous ossify and fosssssil
/bo*/ b and bo and booo ooooo and splat
/a(ha)*/ a and aha and ahaha and ahahaha... ho and hhhh
/hm+/ hm and hmm and hmmm... h and m
/yooho{2,4}/ yoohoo and yoohooo and yoohoooo yooho
/Joh?n/ Jon and John Jan and Johhn
/black|white/ black and white green
/high\s+low/ high low (with some whitespace in between) highlow and highXXXlow
/high\S+low/ highXXXlow (with some non-whitespace in between.) highlow and high low
/\$\d\.\d\d/ $1.01 and $8.95 and $6.70... 23 and $19.95
/.../ any three characters no and hi and we and us

As you can see, regular expressions are usually enclosed in slashes. The various special characters (., *, +, ?, {}) and metacharacters (\d, \s, \S) in Table 2-1 each change the regex behavior in their own way, affecting which strings match and regex and which strings don’t.


Previous Table of Contents Next