DiBona
Perl Monk, Perl Meditation.
Perl Monks 

Common Regex Gotchas

by chromatic on Apr 25, 2000 at 22:00 ( print, xml )
log zgrebe out | zgrebe | The Monastery Gates | Super Search | Snippets | Cool Uses For Perl | Poetry | Code | Obfuscation | Q&A | Library | Seekers of Perl Wisdom | Craft | Meditations | Perl Monks Discussion | Perl News | Reviews | Tutorials | Newest Nodes | Offering Plate
Need Help??
++ --

Common Regex Gotchas

Perl novices often stumble over a few gotchas when first learning regular expressions. Learning the whys and the workarounds could save you hours of frustration.

Greediness

Perl's regex engine likes to match the longest string possible, by default. This is described as greediness. Most people don't think that way, at least when looking at text. Given the following string and regex, what will be in $1?
my $data = "<tag>this is a line of code</tag>
<explanation>this is where I wax poetic about my code</explanation>
<tag>this is another example of code</tag>";
if ($data =~ /<tag>(.*)<\/tag>/s) {
    print "I found =>$1<=\n";
}

If you said "this is a line of code", you're thinking the same thing most people do. Unfortunately, that's not the way Perl thinks:

I found =>this is a line of code</tag>
<explanation>this is where I wax poetic about my code</explanation>
<tag>this is another example of code<=

The secret lies in the mysterious asterisk (match zero or more of the preceding). When the engine hits it, it jumps ahead to the end of the line and tries to match the next character -- the <character. Since the last character in the string is >, the match fails, and the engine backtracks a character. This continues through e, d, o, c, and /, until it finally reaches the final < in $data.

Knowing that, you now understand the danger of greediness (and, hopefully, also why parsing HTML with a regex can be tricky). The solution is very simple:

 if ($data =~ /<tag>(.*?)<\/tag>/s) {
    print "I found =>$1<=\n";
}

Using the ? after a normally-greedy quantifier (* or +) tells the engine not to grab the longest string, but the first string that matches the whole pattern.

Specifying Too Much

This gotcha is more stylistic, but it can come back to haunt you later. Remember that regular expressions can be somewhat vague -- you don't have to specify the entire line, if you're only looking for a certain portion. Suppose that you want to find the word Serial, followed by a colon and then a nine-digit number. The data lines might look like this:
my $line = "Name: Some Soldier, Rank:  Leftenant, Serial: 426879824, B
+oots:  black";
A regex novice might bite off more than he could chew with the following:
$line =~ /^\w*: \w*\s*\w*, \w*: \w*, \w+: (\d)*, \w*: \w*$/;
If all you're interested in is the Serial number, only ask for that. It'll make your regex simpler, and it will handle deviations from what you think the line ought to look like. (That happens more often than you want to think.)
$line =~ /[Ss]erial: (\d{9})/;
Caveat: There are good reasons to break my Rule of Simplicity. Performance is one, and error handling is another. Be sure that the code works first, though, then try to make it tricky.

Special Characters

Don't forget that certain characters (like ., *, /, +, and ?) have special meanings within regular expressions. If you don't have a Unixy background (where escaping characters with a forwardslash is a little more common), you might write something like this, and stare at it in confusion for a while:
$line =~ /<title>(.*?)</title>/;
Hmm. Check the perlman:perlre page for the skinny on exactly which characters have special meaning. Also be aware that choosing alternate delimiters can help out, as well as being more visually appealing:
$line =~ m!<title>(.*?)</title>!;
One other caveat is that, within a character class, these rules often don't apply:
my $line = "a.b.cd*f.";
$line =~ /([^.*]{2})/;

Simple Substitutions

Want to make sure user input is completely uppercased? Here's one approach:
my $input = "foo bar baz";
$input =~ s/(\w+)/uc($1)/ge;
While that works, it's serious overkill. Even a less picky approach is sub-optimal:
$intput =~ s/([A-Za-z]+)/uc($1)/ge;
Don't forget about the friendly tr/// operator -- it's made for simple substitutions like this. (Of course, if you're working with a locale different than simple English text, you're out of luck).
$input =~ tr/a-z/A-Z/;
Regular expressions give you a lot of power at the cost of some speed. Don't get out the chainsaw when a penknife will do.

comment on Common Regex Gotchas
d/l code
Re: Common Regex Gotchas
by Desdinova on Mar 14, 2001 at 19:41
++ --
    Regarding the simple substitutions section just to prove your point about not going overkill i benchmarked the two ways you mentioned (tr and s) as well as just uc with this code
    
    #!/usr/local/bin/perl -w
    use strict;
    use Benchmark;
    my $count =500000;
    ## Method number one
    sub One {
       my $data = 'for bar baz';
       $data = uc $data;
    
    }
    
    ## Method number two
    sub Two {
       my $data = 'for bar baz';
       $data =~ tr/a-z/A-Z/;
    }
    ## Method number Three
    sub Three {
       my $data = 'for bar baz';
       $data =~ s/([A-Za-z]+)/uc($1)/ge;
    }
    ## We'll test each one, with simple labels
    timethese (
      $count,
      {'Method One UC' => '&One',
       'Method Two TR' => '&Two',
       'Method Three s'=> '&Three'
       }
    );
    
    exit;
    
    And got these results:
    
    Benchmark: timing 500000 iterations of Method One UC, Method Three s, 
    +Method Two TR...
    Method One UC:  1 wallclock secs ( 1.42 usr +  0.00 sys =  1.42 CPU) @
    + 352112.68/s (n=500000)
    Method Three s: 16 wallclock secs (17.03 usr +  0.00 sys = 17.03 CPU) 
    +@ 29359.95/s (n=500000)
    Method Two TR:  1 wallclock secs ( 2.04 usr +  0.00 sys =  2.04 CPU) @
    + 245098.04/s (n=500000)
    
    I know this is not new information but i figured i'd post here to highlight what you are saying.
    PS -- The bechmark method stolen from Benchmarking your code

    UPDATE: Xxaxx pointed out to me in This Node That I am not making a fair comparision above. The eval of uc($1) in the s/// regex was eating up a lot of the cycles. The gap is smaller than 17:1 shown above...
    For a fairer test I compared a single char substituion with tr/// and s///
    
     my $data = 'for-bar-baz';
       $data =~ s/-/_/g;
       print $data;
     my $data = 'for-bar-baz';
     $data =~tr/-/_/;
     print $data;
    
    Using the benchmarking above I got hese results:
    
    Benchmark: timing 500000 iterations of Method One TR, Method Two s...
    Method One TR:  2 wallclock secs ( 1.87 usr +  0.00 sys =  1.87 CPU) @
    + 267379.68/s (n=500000)
    Method Two s:  5 wallclock secs ( 4.84 usr +  0.00 sys =  4.84 CPU) @ 
    +103305.79/s (n=500000)
    
    Still there is an advantage to tr/// over s/// which can be more noticable depending on your data.

    Update 2: petral asked me question in the CB about the way i call uc in method one made me realize that it wont actually do anything because I don't assign the return value back to the var. I updated the code to do that.
 [reply]
++ --
      If I am not mistaken the Benchmark module is plagued by the "$& and friends". That means it makes the regexes slow by defualt. That means that the benchmarks you take are disproportionate and useless, since the ineffectiant single instance of $& ruins any optimizations perl can make on the substitution.
 [reply]
++ --
        Happily, that doesn't appear to be the case. I don't see any occurence of the $& et al. variables in the code for Benchmark.pm

        The real problem here is the use of /e on the substitution, when this would work just as well and be much more efficient:

        s/(\w+)/\U$1/g;
        
 [reply]
        Some notes below your chosen depth have not been shown here
Re: Common Regex Gotchas
by Anonymous Monk on May 28, 2001 at 15:50
++ --
    under the section: ``Specifying too much'' you said:

    $line =~ /^\w*: \w*\s*\w*, \w*: \w*, \w+: (\d)*, \w*: \w*$";

    This ends the regexp with a `"' and starts it with `/`. you could do m"..." or /.../ .

 [reply]
++ --

      your allowed to do that with regexs? I just learned you could use , (), "", <>, etc., with qq...

      I should read more on Perl.

 [reply]
Re: Common Regex Gotchas
by John M. Dlugosz on Jul 06, 2001 at 05:47
++ --
    I'd like to point out that tr/a-z/A-Z/ does something different from uc or \U. The former won't deal with "Peña", for example. The latter treats accented characters correctly according to the language in use, and potentially handles weird things like German ß and letters in additon to the normal 26 in English, such as þ.
 [reply]
Re: Common Regex Gotchas
by Anonymous Monk on Nov 27, 2001 at 00:23
++ --
    I found the Greedy section to be quite confusing. First of all, I think you probably have a couple slash-s'es in your html that is causing it to not print and makes the last couple paragraphs very difficult to understand and read. Also, I'm still confused about why there is a match at all in the first example. Why doesn't the engine continue backwards past the whitespace and look for a </tag> string? Finally, why does the last example (still in the Greedy section) work? If, when creating the example string, I carraige return after the </tag>, there shouldn't be a whitespace to match on, right? Finally, finally, thanks for putting this together... it's really speeding my ramp along...
 [reply]
++ --
      Sorry - re-reading it, I realized that I accidently inserted html tags that are not showing up.

      I found the Greedy section to be quite confusing. First of all, I think you probably have a couple slash-s'es in your html that is causing it to not print and makes the last couple paragraphs very difficult to understand and read.

      Also, I'm still confused about why there is a match at all in the first example. Why doesn't the engine continue backwards past the whitespace and look for a <\/tag> string?

      Finally, why does the last example (still in the Greedy section) work? If, when creating the example string, I carraige return after the <\/tag>, there shouldn't be a whitespace to match on, right?

      Finally, finally, thanks for putting this together... it's really speeding my ramp along...

 [reply]
++ --
        Why doesn't the engine continue backwards past the whitespace and look for a <\/tag> string?

        Because the engine prefers the longest match that starts at the leftmost possible position. When it hits .*, it jumps all the way to the end of the string and then backtracks, trying to match the next necessary character. Because it's backtracking, it matches </tag> at the end of the string. That fits the pattern, so it doesn't continue backtracking to find a shorter match.

        If, when creating the example string, I carraige return after the <\/tag>, there shouldn't be a whitespace to match on, right?

        The /s flag allows the '.' token to match newlines. Adding the minimal token '?' avoids the jump-to-end-then-backtrack behavior. It works like you'd expect, trying to match as few characters as possible.

        Does that clear it up? I've also touched up the formatting somewhat.

 [reply]

Back to Tutorials


XP Nodelet
You have 8 votes left today.
Node Status
Node Type: perltutorial
help
Chatterbox
and all is quiet...

How do I use this?
Other Users
Others romping around the monastery: (19)
davorg
rob_au
jmcnamara
larsen
atcroft
theorbtwo
Chady
davis
snafu
AcidHawk
artist
Callum
choocroot
Brutha
osama
zgrebe
chimni
Dzyan
gopi
Sections
Seekers of Perl Wisdom
Categorized Q&A
Cool Uses for Perl
Obfuscated Code
Snippets Section
Code Catacombs
Meditations
Perl Poetry
Craft
Perl News
Information
Guide to the Monastery
Perl Monks Site FAQ
Site How To
Voting/Experience System
Outside Links
Tutorials
Library
Perl FAQ
Your Input
Perl Monks Discussion
Make your petition
Editor Requests
Leftovers
log zgrebe out
Super Search
The St. Larry Wall Shrine
Saints in our Book
Perl Monks User Search
Newest Nodes
Quests
Awards
Random Node
Perl Monks Merchandise
Buy PerlMonks Gear and Books
Voting Booth
Top goal for 2003
Lose weight
Get In Shape
Money Related
Get a Date
Become an Uber-Coder
Win the Lottery
Get a life
Get a (new)? job
Other
[results]
  [305 votes][past polls]