[Tutor] Question regular expressions - the non-greedy pattern

Steven D'Aprano steve at pearwood.info
Mon Jan 21 17:32:09 CET 2013


On 22/01/13 01:45, Marcin Mleczko wrote:

> Now I'm changing the input string to (adding an extra '<'):
>
> s = '<<html><head><title>Title</title>'
>
> and evoking the last command again:
>
> print re.match('<.*?>', s).group()
> I would expect to get the same result
>
> <html>
>
> as I'm using the non-greedy pattern. What I get is
>
> <<html>
>
> Did I get the concept of non-greedy wrong or is this really a bug?


Definitely not a bug.


Your regex says:

"Match from the beginning of the string: less-than sign, then everything
up to the FIRST (non-greedy) greater-than sign."

So it matches the "<" at the beginning of the string, followed by the
"<html", followed by ">".


To get the result you are after, you could do this:

# Match two < signs, but only report from the second on
re.match('<(<.*?>)', s).group(1)


# Skip the first character
re.match('<.*?>', s[1:]).group()


# Don't match on < inside the <> tags
re.search('<[^<]*?>', s).group()


Notice that the last example must use re.search, not re.match,
because it does not match the beginning of the string.



By the way, you cannot parse general purpose HTML with a regular
expressions. You really should learn how to use Python's html
parsers, rather than trying to gerry-rig something that will do a
dodgy job.




-- 
Steven


More information about the Tutor mailing list