`re' difficulty?

François Pinard pinard at IRO.UMontreal.CA
Tue Oct 19 11:12:16 EDT 1999


"Darrell" <darrell at dorb.com> écrit:

> [.0-9]+[a-z]? matched, so there was no need to find a match on the right
> hand side ???

Traditionally (for me!), regular expressions are compiled into an automaton
in such a way that no backtracking occurs, all alternatives are kind of
logically matched in parallel.  The longuest overall match wins, at the
whole regular expression level.  This was rather ubiquitous in Unix.

Take Flex, for which all input is indeed a kind of biiig alternative.
Since each rule has an action associated, there is a supplementary rule by
which the action executed is the one for the longest match, and the one which
appears earlier in the Flex source, in case many rules match the same length.

The problem is that the representation of those efficient automatons often
require a lot of memory.  I'm no specialist, but I got that `Rx' failure
was partly revolving around the proper tuning of such explosions.  A good
part of Flex virtues is the proper care around various memory compression
schemes for representating the matching automatons.

I wrote a few regular expression matching packages myself, but none has
never been worth retaining.  It is just to say that I've always been a bit
curious about such matters.  But from the few people I know who maintained
_good_ regular expression packages, it seems that maintenance requires a lot
of investment and energy.  Backtracking matchers are usually easier to write.

Having `|' to be non-commutative is a bit Snobolish to me.  (I loved Snobol,
in its time :-).  I would be tempted to guess that this prohibits the writing
of real fast matchers, but the truth is that I really do not know, and would
have to think much harder before really asserting anything in that area.

I was a bit astonished by the `[*+?]?' constructs.  I felt they were only
meaningful to give a meaning to the precise extent of various `(...)',
since these just cannot change what is the overal longuest match anyway,
and wondered how these were implemented.  But if, as Tim writes in another
reply, the longuest match rule just does not hold, it surely changes a lot
of things, and the implementation suddenly becomes much less mysterious! :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard




More information about the Python-list mailing list