[Tutor] Question regular expressions - the non-greedy pattern
Steven D'Aprano
steve at pearwood.info
Mon Jan 21 17:32:09 CET 2013
On 22/01/13 01:45, Marcin Mleczko wrote:
> Now I'm changing the input string to (adding an extra '<'):
>
> s = '<<html><head><title>Title</title>'
>
> and evoking the last command again:
>
> print re.match('<.*?>', s).group()
> I would expect to get the same result
>
> <html>
>
> as I'm using the non-greedy pattern. What I get is
>
> <<html>
>
> Did I get the concept of non-greedy wrong or is this really a bug?
Definitely not a bug.
Your regex says:
"Match from the beginning of the string: less-than sign, then everything
up to the FIRST (non-greedy) greater-than sign."
So it matches the "<" at the beginning of the string, followed by the
"<html", followed by ">".
To get the result you are after, you could do this:
# Match two < signs, but only report from the second on
re.match('<(<.*?>)', s).group(1)
# Skip the first character
re.match('<.*?>', s[1:]).group()
# Don't match on < inside the <> tags
re.search('<[^<]*?>', s).group()
Notice that the last example must use re.search, not re.match,
because it does not match the beginning of the string.
By the way, you cannot parse general purpose HTML with a regular
expressions. You really should learn how to use Python's html
parsers, rather than trying to gerry-rig something that will do a
dodgy job.
--
Steven
More information about the Tutor
mailing list