[Tutor] Question regular expressions - the non-greedy pattern

Mon Jan 21 16:36:44 CET 2013

On Mon, Jan 21, 2013 at 3:45 PM, Marcin Mleczko <Marcin.Mleczko at onet.eu>wrote:

> Now I'm changing the input string to (adding an extra '<'):
>
> s = '<<html><head><title>Title</title>'
>
> and evoking the last command again:
>
> print re.match('<.*?>', s).group()
> I would expect to get the same result
>
> <html>
>
> as I'm using the non-greedy pattern. What I get is
>
> <<html>
>
> Did I get the concept of non-greedy wrong or is this really a bug?
>

No, this is not a bug. Note first that you are using re.match, which only
tries to match from the beginning of the string. If you want to match
anywhere inside the string, you should use re.search, which returns the
first match found. However even re.search will still return '<<html>' since
that *is* a valid match of the regular expression  '<.*?>', and re.search
returns the first match it finds.

in essence, re.search first tries calling match(regex, s), then
match(regex, s[1:]), then match(regex, s[2:]) and so on and so on, moving
on one character at the time until the regular expression produces a match.
Since the regex produces a match on the first character, matching on the
second isn't even tried.

It is true that non-greedy matching will try to match the fewest number of
characters possible. However, it will not cause the regular expression
engine to backtrack, i.e. go back on parts of the pattern already matched
and match them elsewhere to try and see if that produces a shorter match.
If a greedy variant of a regex matches, then the non-greedy variant *will*
also match at the same place. The only difference is the length of the
result.

more generally, regexes can not parse HTML fully since they simply lack the
power. HTML is just not a regular language. If you want to parse arbitrary
HTML documents, or even sufficiently complex HTML documents you should get
a real HTML parser library (python includes one, check the docs). If you
just want to grab some data from HTML tags it's probably ok to use regexes
though, if you're careful.

HTH,
Hugo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20130121/f23a7f49/attachment.html>