| Previous | Table of Contents | Next |
So if you just want to match tabs and not spaces, you can say
print "Tabby" if /\t/;
Want to annoy everyone within earshot? Put on some earplugs, and type
perl -e 'print "\a" while 1'
Hexadecimal and Octal Characters Inside regexes
- \xNUM is a hexadecimal (base 16) number.
- \NUM is an octal (base 8) number. NUM must be two or three digits.
If you dont know what hexadecimal or octal numbers are, theres a short description in Chapter 4, Session 8. Hexadecimal numbers (that is, numbers in base 16 instead of our more familiar base 10) are represented with a \x prefix.
\x1a
is 26,
\x100
is 256, and
\xfffff
is 1048575.
A backslash followed by two or three digits is interpreted as an octal number (that is, a number in base 8): \00 is ASCII 0 (the null character), and \107 is ASCII 71: an uppercase G.
You can remove any non-7-bit-ASCII characters from a string with
$string =~ tr/\000-\177//cd;
because decimal 127 is octal 177.
Because a newline is 10 in ASCII and 10 is a in hexademical, you can replace periods with newlines in three ways:
s/\./\n/g
or
s/\./\xa/g
or even
s/\./\12/g
Perl defines five characters to let you tailor how regexes deal with capitalization.
\L, \U, \L, \U, \Q, and \E
- \l lowercases the next character.
- \u uppercases the next character.
- \L lowercases everything until the following \E.
- \U uppercases everything until the following \E.
- \Q quotes metacharacters until the following \E.
Suppose you want to match The Perl Journal without sensitivity to case, except that you want to ensure that the T and P and J are capitalized. Heres how you could do it:
/\uThe \uPerl \uJournal/
which is the same as
/\uThe \uPerl \uJournal/
Both match The Perl Journal and THE PERL JOURNAL but not the perl journal or The perl journal.
\l and \u and \L and \U can be used to match particular capitalizations, as capital (Listing 2-19) demonstrates.
Listing 2-19 capital: Using \l, \u, \L, and \U
#!/usr/bin/perl -w
$name = "franklin";
while (<>) {
print "matched \\l \n" if /\l$name/;
print "matched \\u \n" if /\u$name/;
print "matched \\L...\\E \n" if /\L$name\E/;
print "matched \\U...\\E \n" if /\U$name\E/;
}
Lines typed by the user are matched against four variations of the pattern franklin.
% capital RESULT: franklin matched \l matched \L...\E Franklin matched \u FRANKLIN matched \U...\E
When might you want to use \Q? As youve seen, your programs can construct their own patterns:
$movie = '\w*'
or even construct them from user input, as tickets6 did.
chomp($_ = <>); if ($movie =~ /$_/);
Patterns constructed from variables in this way might inadvertently contain special characters: Suppose the user types
The Color of $
or
The * Chamber
The $pattern in woolf (Listing 2-20) contains a question mark, which is interpreted as a special character.
Listing 2-20 woolf: A question mark interpreted as a special character
#!/usr/bin/perl
$pattern = "Who's Afraid of Virginia Woolf?"; # ? will mean 0 or 1 fs
$movie = "Who's Afraid of Virginia Wool";
if ($movie =~ /$pattern/) { print "Whoops!"; }
else { print "Okay."; }
% woolf
RESULT: Whoops!
You can implicitly backslash metacharacters and special characters by using \Q and \E, as shown in Listing 2-21.
Listing 2-21 woolf2: using \Q to avoid interpreting regex characters
#!/usr/bin/perl -l
$pattern = "Who's Afraid of Virginia Woolf?";
$movie = "Who's Afraid of Virginia Wool";
if ($movie =~ /\Q$pattern\E/) { print "Whoops!"; }
else { print "Okay."; }
Because $pattern is between \Q and \E, the ? takes its literal meaning instead of its regex meaning.
% woolf2 RESULT: Okay.
Youve seen that tr/// lets you specify a set of characters. You can use sets with s/// and m// as well, using square brackets ( [ ] ).
Character Sets
Inside regular expressions, character sets can be enclosed in square brackets.
s/[A-Z]\w*//g;
removes words beginning with a capital letter.
Inside brackets, characters are implicitly ORed:
s/[aeiou]//g;
removes all vowels.
Special characters lose their special meanings inside brackets. (Metacharacters dont.) The hyphen adopts a new mearning, as youve seen, and so does ^, which when placed at the beginning of the set, complements the rest of the set, so that
s/[^b]//g;
removes anything thats not a b.
/[A-Z][^A-Z]/;
matches any string that contains a capital letter followed by a noncapital.
s/[^aeiou\s.]//g;
removes everything but vowels, whitespace, and periods.
| Previous | Table of Contents | Next |