[Python-ideas] Regular expression algorithms

Thu Apr 12 18:04:57 CEST 2007

Adam Atlas <adam at atlas.st> wrote:
> 
> Has anyone seen this article?
> http://swtch.com/~rsc/regexp/regexp1.html

Yes, it has been posted in the sourceforge tracker as a feature request,
in python-dev, and now here.

> Are its criticisms of Python's regex algorithm accurate?

In the worst-case, yes, Python's regular expression runs in O(2^n) time
(where n is the length of the string you are searching).  However, as
stated in the sourceforge entry, and has been stated in other places,
one has to write a fairly useless regular expression to get it into the
O(2^n) running time.  For many cases, Python's regular expression engine
is quite competitive with the Thompson NFA.

> If so, might  
> it be possible to revise Python's `re` module to use this sort of  
> algorithm? I noticed it says that this approach doesn't work if the  
> pattern contains backreferences, but maybe the module could at least  
> sort of self-optimize by switching to this method when no backrefs  
> are used.

It is possible, but only if someone takes the time to offer a patch. 
One thing to remember is that as stated in the documentation for
Python's re module, certain operators are "greedy", that is, a* will
pick up as many a's as it possibly can.  Where as a base Thompson NFA
will move on to the next state as early as possible, making a* with
Thompson analagous to a*? in the Python (and others') regular expression
engine. Yes, the Thompson NFA can be modified to also be greedy, but
that is a particular characteristic of Python's engine that a Thompson
NFA based engine will have to emulate (along with groups, named
references, etc., which are a PITA for non-recursive engines).

 - Josiah