my $data = "<tag>this is a line of code</tag> <explanation>this is where I wax poetic about my code</explanation> <tag>this is another example of code</tag>"; if ($data =~ /<tag>(.*)<\/tag>/s) { print "I found =>$1<=\n"; }
If you said "this is a line of code", you're thinking the same thing most people do. Unfortunately, that's not the way Perl thinks:
I found =>this is a line of code</tag> <explanation>this is where I wax poetic about my code</explanation> <tag>this is another example of code<=
The secret lies in the mysterious asterisk (match zero or more of the preceding). When the engine hits it, it jumps ahead to the end of the line and tries to match the next character -- the <character. Since the last character in the string is >, the match fails, and the engine backtracks a character. This continues through e, d, o, c, and /, until it finally reaches the final < in $data.
Knowing that, you now understand the danger of greediness (and, hopefully, also why parsing HTML with a regex can be tricky). The solution is very simple:
if ($data =~ /<tag>(.*?)<\/tag>/s) { print "I found =>$1<=\n"; }
Using the ? after a normally-greedy quantifier (* or +) tells the engine not to grab the longest string, but the first string that matches the whole pattern.
my $line = "Name: Some Soldier, Rank: Leftenant, Serial: 426879824, B +oots: black";
$line =~ /^\w*: \w*\s*\w*, \w*: \w*, \w+: (\d)*, \w*: \w*$/;
$line =~ /[Ss]erial: (\d{9})/;
$line =~ /<title>(.*?)</title>/;
$line =~ m!<title>(.*?)</title>!;
my $line = "a.b.cd*f."; $line =~ /([^.*]{2})/;
my $input = "foo bar baz"; $input =~ s/(\w+)/uc($1)/ge;
$intput =~ s/([A-Za-z]+)/uc($1)/ge;
$input =~ tr/a-z/A-Z/;
#!/usr/local/bin/perl -w use strict; use Benchmark; my $count =500000; ## Method number one sub One { my $data = 'for bar baz'; $data = uc $data; } ## Method number two sub Two { my $data = 'for bar baz'; $data =~ tr/a-z/A-Z/; } ## Method number Three sub Three { my $data = 'for bar baz'; $data =~ s/([A-Za-z]+)/uc($1)/ge; } ## We'll test each one, with simple labels timethese ( $count, {'Method One UC' => '&One', 'Method Two TR' => '&Two', 'Method Three s'=> '&Three' } ); exit;
Benchmark: timing 500000 iterations of Method One UC, Method Three s, +Method Two TR... Method One UC: 1 wallclock secs ( 1.42 usr + 0.00 sys = 1.42 CPU) @ + 352112.68/s (n=500000) Method Three s: 16 wallclock secs (17.03 usr + 0.00 sys = 17.03 CPU) +@ 29359.95/s (n=500000) Method Two TR: 1 wallclock secs ( 2.04 usr + 0.00 sys = 2.04 CPU) @ + 245098.04/s (n=500000)
my $data = 'for-bar-baz'; $data =~ s/-/_/g; print $data; my $data = 'for-bar-baz'; $data =~tr/-/_/; print $data;
Benchmark: timing 500000 iterations of Method One TR, Method Two s... Method One TR: 2 wallclock secs ( 1.87 usr + 0.00 sys = 1.87 CPU) @ + 267379.68/s (n=500000) Method Two s: 5 wallclock secs ( 4.84 usr + 0.00 sys = 4.84 CPU) @ +103305.79/s (n=500000)
The real problem here is the use of /e on the substitution, when this would work just as well and be much more efficient:
s/(\w+)/\U$1/g;
$line =~ /^\w*: \w*\s*\w*, \w*: \w*, \w+: (\d)*, \w*: \w*$";
This ends the regexp with a `"' and starts it with `/`. you could do m"..." or /.../ .
your allowed to do that with regexs? I just learned you could use , (), "", <>, etc., with qq...
I should read more on Perl.
I found the Greedy section to be quite confusing. First of all, I think you probably have a couple slash-s'es in your html that is causing it to not print and makes the last couple paragraphs very difficult to understand and read.
Also, I'm still confused about why there is a match at all in the first example. Why doesn't the engine continue backwards past the whitespace and look for a <\/tag> string?
Finally, why does the last example (still in the Greedy section) work? If, when creating the example string, I carraige return after the <\/tag>, there shouldn't be a whitespace to match on, right?
Finally, finally, thanks for putting this together... it's really speeding my ramp along...
Because the engine prefers the longest match that starts at the leftmost possible position. When it hits .*, it jumps all the way to the end of the string and then backtracks, trying to match the next necessary character. Because it's backtracking, it matches </tag> at the end of the string. That fits the pattern, so it doesn't continue backtracking to find a shorter match.
If, when creating the example string, I carraige return after the <\/tag>, there shouldn't be a whitespace to match on, right?
The /s flag allows the '.' token to match newlines. Adding the minimal token '?' avoids the jump-to-end-then-backtrack behavior. It works like you'd expect, trying to match as few characters as possible.
Does that clear it up? I've also touched up the formatting somewhat.