[Tutor] Question regular expressions - the non-greedy pattern

Mon Jan 21 17:32:09 CET 2013

On 22/01/13 01:45, Marcin Mleczko wrote:

> Now I'm changing the input string to (adding an extra '<'):
>
> s = '<<html><head><title>Title</title>'
>
> and evoking the last command again:
>
> print re.match('<.*?>', s).group()
> I would expect to get the same result
>
> <html>
>
> as I'm using the non-greedy pattern. What I get is
>
> <<html>
>
> Did I get the concept of non-greedy wrong or is this really a bug?

Definitely not a bug.

Your regex says:

"Match from the beginning of the string: less-than sign, then everything
up to the FIRST (non-greedy) greater-than sign."

So it matches the "<" at the beginning of the string, followed by the
"<html", followed by ">".

To get the result you are after, you could do this:

# Match two < signs, but only report from the second on
re.match('<(<.*?>)', s).group(1)

# Skip the first character
re.match('<.*?>', s[1:]).group()

# Don't match on < inside the <> tags
re.search('<[^<]*?>', s).group()

Notice that the last example must use re.search, not re.match,
because it does not match the beginning of the string.

By the way, you cannot parse general purpose HTML with a regular
expressions. You really should learn how to use Python's html
parsers, rather than trying to gerry-rig something that will do a
dodgy job.

-- 
Steven