leftmost longest match (of disjunctions) ; greediness of "|"

Diez B. Roggisch deets_noospaam at web.de
Tue Dec 2 08:45:35 EST 2003


Joerg Schuster wrote:

> Peter Hansen <peter at engcorp.com> writes:
> 
>> produce a longer overall match.  In other words, the "|" operator is
>> never greedy.
> 
> O.k. Thanks for pointing this out. Maybe I should have formulated my
> question differently: Is there a trick (be it dirty or not) to make
> "|" greedy in python?

I don't think so - as regexps are finite state automata, their capabilities
are limited to what these machines can do. So they can't backtrack to a
part of the disjunction that was a shorter match if the longer match
suddenly will fail. Consider this example:

rex: a|abc

string: abd

If the state-machine was greedy, it would follow the second path and fail. 

The only thing you can do to avoid is is to split your disjunctions and
search separately, and afterwards takt the longest part. That could be done
using a small grammar, that allows you to detect the disjunctions and
create separate rexes for it.

Diez




More information about the Python-list mailing list