re.match -- not greedy?

Carl Banks pavlovevidence at gmail.com
Sun Nov 19 19:11:43 EST 2006


EXI-Andrews, Jack wrote:
> the '*' and '+' don't seem to be greedy.. they will consume less in
> order to match a string:
>
> >>> import re;re.match('(a+)(ab)','aaab').groups()
> ('aa', 'ab')
>
> this is the sort of behaviour i'd expect from
>    '(a+?)(ab)'
>
> a+ should greedily consume a's at the expense of the string not matching

It's called backtracking.  If a match fails, the regexp engine
recursively unwinds some of the greedy matching and tries again.

> in real life, i am trying to match #defines:
>
> >>> re.match( '#define ([A-Za-z0-9_]+)([^(])', '#define
> abc(d)').groups()
> ('ab', 'c')
>
> i want this example to fail because the first character after a string
> of letters is a '('
> i want to match only #defines without parameters.

Probably the easiest thing to do is to attempt to match zero or one
opening parentheses, and bail out if it did end up matching the
parenthesis.

m = re.match(r"#define ([A-Za-z0-9_]+)(\(?)","#define abc(d)")
if m and m.groups(2) != "(":
    ...

(BTW, it's good practice to use a raw strings for the regular
expressions as I did.)

> so what's the definition of greedy?

It means match as much you can *while getting a match*.

Just remember regexp parsing is not a one way street: if it hits a dead
end, it'll go back and try another path.  In fact, greediness usually
doesn't affect *whether* a regexp matches; it only affects the
groupings.  I'm suddenly curious if there are any cases at all where
greediness changes whether it finds a match.

> (pls copy me on responses)

Ah, too bad you won't see it then....


Carl Banks




More information about the Python-list mailing list