
Has anyone seen this article? http://swtch.com/~rsc/regexp/regexp1.html Are its criticisms of Python's regex algorithm accurate? If so, might it be possible to revise Python's `re` module to use this sort of algorithm? I noticed it says that this approach doesn't work if the pattern contains backreferences, but maybe the module could at least sort of self-optimize by switching to this method when no backrefs are used.

Adam Atlas <adam@atlas.st> wrote:
Has anyone seen this article? http://swtch.com/~rsc/regexp/regexp1.html
Yes, it has been posted in the sourceforge tracker as a feature request, in python-dev, and now here.
Are its criticisms of Python's regex algorithm accurate?
In the worst-case, yes, Python's regular expression runs in O(2^n) time (where n is the length of the string you are searching). However, as stated in the sourceforge entry, and has been stated in other places, one has to write a fairly useless regular expression to get it into the O(2^n) running time. For many cases, Python's regular expression engine is quite competitive with the Thompson NFA.
It is possible, but only if someone takes the time to offer a patch. One thing to remember is that as stated in the documentation for Python's re module, certain operators are "greedy", that is, a* will pick up as many a's as it possibly can. Where as a base Thompson NFA will move on to the next state as early as possible, making a* with Thompson analagous to a*? in the Python (and others') regular expression engine. Yes, the Thompson NFA can be modified to also be greedy, but that is a particular characteristic of Python's engine that a Thompson NFA based engine will have to emulate (along with groups, named references, etc., which are a PITA for non-recursive engines). - Josiah

Josiah Carlson wrote:
Are you sure that's an inherent characteristic of a Thompson NFA? As I understood it, using a Thompson NFA is no different from building an NFA and converting it to a DFA, except it does the conversion lazily. And when using a DFA, whether it matches greedily or not depends on how you drive it. If you stop as soon as you reach the first accepting state, it's non-greedy; if you keep going until you can't go any further, it's greedy. -- Greg

Adam Atlas <adam@atlas.st> wrote:
Has anyone seen this article? http://swtch.com/~rsc/regexp/regexp1.html
Yes, it has been posted in the sourceforge tracker as a feature request, in python-dev, and now here.
Are its criticisms of Python's regex algorithm accurate?
In the worst-case, yes, Python's regular expression runs in O(2^n) time (where n is the length of the string you are searching). However, as stated in the sourceforge entry, and has been stated in other places, one has to write a fairly useless regular expression to get it into the O(2^n) running time. For many cases, Python's regular expression engine is quite competitive with the Thompson NFA.
It is possible, but only if someone takes the time to offer a patch. One thing to remember is that as stated in the documentation for Python's re module, certain operators are "greedy", that is, a* will pick up as many a's as it possibly can. Where as a base Thompson NFA will move on to the next state as early as possible, making a* with Thompson analagous to a*? in the Python (and others') regular expression engine. Yes, the Thompson NFA can be modified to also be greedy, but that is a particular characteristic of Python's engine that a Thompson NFA based engine will have to emulate (along with groups, named references, etc., which are a PITA for non-recursive engines). - Josiah

Josiah Carlson wrote:
Are you sure that's an inherent characteristic of a Thompson NFA? As I understood it, using a Thompson NFA is no different from building an NFA and converting it to a DFA, except it does the conversion lazily. And when using a DFA, whether it matches greedily or not depends on how you drive it. If you stop as soon as you reach the first accepting state, it's non-greedy; if you keep going until you can't go any further, it's greedy. -- Greg
participants (3)
-
Adam Atlas
-
Greg Ewing
-
Josiah Carlson