Previous Table of Contents Next


matches hmmm and hmmmm and hmmmmm and ... (It’s equivalent to /hmmmm*/ and /hmmm+/. Do you see why?)

You CAN’T set an upper bound without a lower bound: /hm{,3}/ won’t do what you want!

Question Marks

? means “zero or one.” You can use the ? special character as a shorthand for {0,1}.

/Joh?n/

matches Jon and John: ‘Jo’ followed by 0 or 1 hs followed by an n. (An additional question mark has a special meaning: Check out the section titled “Greediness” in Session 7.)

These are special characters, as opposed to metacharacters (Figure 2-4). Here’s how to tell the difference:


Figure 2-4  In regular expressions, metacharacters are “normal” characters with a special meaning when backslashed; special characters are “weird” characters with a special meaning when not backslashed.


Metacharacters vs. Special Characters
Metacharacters are alphanumeric characters that, when backslashed, have a special meaning. Example: \s.
Special characters are nonalphanumeric characters that have a special meaning unless they’re backslashed. Example: +.

The . Special Character

Here’s another special character: the dot. Often, you’ll want to express “any character” in your regexes. You can do that with the “.” special character, which matches anything except for a newline.

/./ matches any character (except newline)
/.../ matches any three characters (except newlines)
/.*/ matches any number (including zero) of characters (except newlines).

. always matches exactly one character: /the../ matches there and their and theta and theme, but not the or then. If you wanted to match those as well, you could do it with the special characters you just learned, such as

/the.?.?/

or

/the.{0,2}/

Experiment!

Let’s say a contract has been e-mailed to you and you want to make four changes.

1.  Replace the word “whereas” with “since.”
2.  Replace signature lines with your name (which, coincidentally, happens to be “Your Name”).
3.  Replace all dates with “sometime.”
4.  Replace the phrase “one-half” with “1/2.”

Listing 2-6 shows a program that uses *, +, ?, and {} to get the job done.

Listing 2-6 destuff: Using *, ?, and {} in regular expressions

#!/usr/bin/perl -wn

s/_{3,}/Your Name/g;   # Replaces any series of >= 3 underscores
s/Whereas/Since/g;     # Replaces Whereas
s!one-half!1/2!g;      # Replaces one-half

                       # Replaces dates in May
s/May\s\S\S?,\s*\S+/sometime/g;

print;

Let’s run destuff on the text file contract, the contents of which are shown here:

     Whereas the party of the first part, ___________, and the party of
the second part, known as That Guy, have entered into a contract in good
faith as of May 4,1996 and:
     Whereas __________ was on May 5, 1996 rendered one-half of the
payment and will be rendered the remaining one-half as of May 30, 1997
upon request by That Guy.
     Whereas and hereunto this day of May 16,  1996 forthwith
undersigned: Signature: _____________

Here’s the result:

% destuff contract
RESULT: Since the party of the first part, Your Name, and the party of
the second part, known as That Guy, have entered into a contract in good
faith as of sometime and:
        Since Your Name was on sometime rendered 1/2 of the payment
and will be rendered the remaining 1/2 as of sometime upon request by
That Guy.
        Since and hereunto this day of sometime forthwith
undersigned: Signature: Your Name

This uses all the metacharacters and special characters that you’ve learned so far, as well as one new feature: the ability of s/// to use a delimiter other than a slash. Since destuff replaces one-half with 1/2, you might be tempted to write s/one-half/1/2/g. But because of the extra slash in 1/2, Perl would think “replace one-half with 1, using the modifier 2, and hey, what’s that extra /g doing there?” Luckily, you can use different delimiters to separate the chunks of s///.

s! one-half!1/2! g
s# one-half#1/2# g
s@one-half@1/2@g
s&one-half&1/2&g

Most nonalphanumeric characters are valid delimiters.

Quiz 2

1.  Which statement is false?
a.  * and + are special characters; \s and \S are metacharacters.
b.  The . special character matches all characters.
c.  \s{2,4} matches two, three, or four spaces.
d.  The /g modifier matches all occurrences of a pattern, instead of just the first.
2.  What’s printed by the following code?
$_ = 'E Pluribus Unum';
s/\s/-/;
s/Pluribus/of many/g;
s/Unum/1/;
s/E/Out/;
s/ /,\s/;
print;
a.  
Out of many 1
b.  
Out-of-many,-1
c.  
Out-of,many 1
d.  
Out-of,\smany 1
3.  Which string does not match /a+b*c?/ ?
a.  aaabbc
b.  a
c.  abc
d.  b
4.  Which string does not match /d{1,3}.o?.g+/ ? (Be careful!)
a.  dddg
b.  ddddoogg
c.  dog
d.  dogg

Exercise 2

Difficulty: Easy

Write a program that spoofs British English by making three substitutions. First, if a word ends in or and has more than four letters, it should replace or with our, so that color becomes colour. Second, all words ending in zation or ze should substitute ss for zs, so that realize becomes realise. Third, coffee should become tea.


Previous Table of Contents Next