[Python-ideas] Regular expression algorithms
Josiah Carlson
jcarlson at uci.edu
Thu Apr 12 18:04:57 CEST 2007
Adam Atlas <adam at atlas.st> wrote:
>
> Has anyone seen this article?
> http://swtch.com/~rsc/regexp/regexp1.html
Yes, it has been posted in the sourceforge tracker as a feature request,
in python-dev, and now here.
> Are its criticisms of Python's regex algorithm accurate?
In the worst-case, yes, Python's regular expression runs in O(2^n) time
(where n is the length of the string you are searching). However, as
stated in the sourceforge entry, and has been stated in other places,
one has to write a fairly useless regular expression to get it into the
O(2^n) running time. For many cases, Python's regular expression engine
is quite competitive with the Thompson NFA.
> If so, might
> it be possible to revise Python's `re` module to use this sort of
> algorithm? I noticed it says that this approach doesn't work if the
> pattern contains backreferences, but maybe the module could at least
> sort of self-optimize by switching to this method when no backrefs
> are used.
It is possible, but only if someone takes the time to offer a patch.
One thing to remember is that as stated in the documentation for
Python's re module, certain operators are "greedy", that is, a* will
pick up as many a's as it possibly can. Where as a base Thompson NFA
will move on to the next state as early as possible, making a* with
Thompson analagous to a*? in the Python (and others') regular expression
engine. Yes, the Thompson NFA can be modified to also be greedy, but
that is a particular characteristic of Python's engine that a Thompson
NFA based engine will have to emulate (along with groups, named
references, etc., which are a PITA for non-recursive engines).
- Josiah
More information about the Python-ideas
mailing list