regex (?!..) problem

Carl Banks pavlovevidence at gmail.com
Mon Oct 5 08:00:37 CEST 2009


On Oct 4, 9:34 pm, Wolfgang Rohdewald <wolfg... at rohdewald.de> wrote:
> Hi,
>
> I want to match a string only if a word (C1 in this example) appears
> at most once in it. This is what I tried:
>
> >>> re.match(r'(.*?C1)((?!.*C1))','C1b1b1b1 b3b3b3b3 C1C2C3').groups()
>
> ('C1b1b1b1 b3b3b3b3 C1', '')>>> re.match(r'(.*?C1)','C1b1b1b1 b3b3b3b3 C1C2C3').groups()
>
> ('C1',)
>
> but this should not have matched. Why is the .*? behaving greedy
> if followed by (?!.*C1)?

It's not.

> I would have expected that re first
> evaluates (.*?C1) before proceeding at all.

It does.

What you're not realizing is that if a regexp search comes to a dead
end, it won't simply return "no match".  Instead it'll throw away part
of the match, and backtrack to a previously-matched variable-length
subexpression, such as ".*?", and try again with a different length.

That's what happened above.  At first the group "(.*?C1)" non-greedily
matched the substring "C1", but it couldn't find a match under those
circumstances, so it backtracked to the ".*?".  and looked a longer
match, which it found.

Here's something to keep in mind: except for a few corner cases,
greedy versus non-greedy will not affect the substring matched, it'll
only affect the groups.


> I also tried:
>
> >>> re.search(r'(.*?C1(?!.*C1))','C1b1b1b1 b3b3b3b3
>
> C1C2C3C4').groups()
> ('C1b1b1b1 b3b3b3b3 C1',)
>
> with the same problem.
>
> How could this be done?

Can't be done with regexps.

How you would do this kind of depends on your overall goals, but your
first look should be toward the string methods.  If you share details
with us we can help you choose a better strategy.


Carl Banks



More information about the Python-list mailing list