regexp non-greedy matching bug?

John Hazen john at hazen.net
Sun Dec 4 03:14:48 EST 2005


[Mike Meyer]
> The thing to understand is that regular expressions are *search*
> functions, that return the first parsing that matches. They search a
> space of possible matches to each term in the expression. If some term
> fails to match, the preceeding term goes on to its next match, and you
> try again. The "greedy" vs. "non-greedy" describes the order that the
> term in question tries matches. If it's greedy, it will try the
> longest possible match first. If it's non-greedy, it'll try the
> shortest possible match first.

That's a good explanation.  Thanks.

[John]
> > I want to match one or two instances of a pattern in a string.
> >>>> foofoo = re.compile(r'^(foo)(.*?)(foo)?(.*?)$')

[Mike]
> First, this pattern doesn't look for one or two instances of "foo" in
> a string. It looks for a string that starts with "foo" and maybe has a
> second "foo" in it as well.

Right.  In simplifying the expression for public consumption, one of the
terms I dropped was r'^.*?(foo)...'.

> To do what you said you want to do, you want to use the split method:
> 
> foo = re.compile('foo')
> if 2 <= len(foo.split(s)) <= 3:
>    print "We had one or two 'foo's"

Well, this would solve my dumbed down example, but each foo in the
original expression was a stand-in for a more complex term.  I was using
match groups to extract the parts of the match that I wanted.  Here's an
example (using Tim's correction) that actually demonstrates what I'm
doing:

>>> s = 'zzzfoo123barxxxfoo456baryyy'
>>> s2 = 'zzzfoo123barxxxfooyyy'
>>> foobar2 = re.compile(r'^.*?foo(\d+)bar(.*foo(\d+)bar)?.*$')
>>> print foobar2.match(s).group(1)
123
>>> print foobar2.match(s).group(3)
456
>>> print foobar2.match(s2).group(1)
123
>>> print foobar2.match(s2).group(3)
None
>>> 


Looking at re.split, it doesn't look like it returns the actual matching
text, so I don't think that fits my need.

> As the founder of SPARE...

Hmm, not a very effective name.  A google search didn't fing any obvious
hits (even after adding the "python" qualifier, and removing "spare time"
and "spare parts" hits).  (I couldn't find it off your homepage,
either.)

Thanks for your help.  If you have any suggestions about a non-re way to
do the above, I'd be interested.

-John



More information about the Python-list mailing list