[Python-Dev] regexp in Python

Bartłomiej Wołowiec b.wolowiec at students.mimuw.edu.pl
Wed Mar 21 19:42:03 CET 2007


For some time I'm interested in regular expressions and Finite State Machine. 
Recently, I saw that Python uses "Secret Labs' Regular Expression Engine", 
which very often works too slow. Its pesymistic time complexity is O(2^n), 
although other solutions, with time complexity O(n*m) ( O(n*m^2), m is the 
length of the regular expression and n is the length of the text, 
introduction to problem: http://swtch.com/~rsc/regexp/regexp1.html )

For example:

$ cat test.py
import re
import sys
if sys.argv[1:]:
        n  = int(sys.argv[1])
        regexp = 'a?'*n + 'a'*n
        print 'regexp=', regexp
        str = 'a'*n + 'b' + 'a'*n
        print 'str=', str
        print re.findall(regexp, str)

$ time python test.py 20
regexp= a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa
str= aaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaa
['aaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaa']

real    0m0.592s
user    0m0.508s
sys     0m0.028s

$ time python test.py 21
regexp= a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaa
str= aaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaa
['aaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaa']

real    0m1.018s
user    0m1.000s
sys     0m0.004s

$ time python test.py 22
regexp= a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaa
str= aaaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaaa
['aaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaa']

real    0m2.062s
user    0m1.992s
sys     0m0.028s

$ time python test.py 23
regexp= a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaa
str= aaaaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaaaa
['aaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaa']

real    0m4.159s
user    0m4.092s
sys     0m0.024s

Unfortuntely not for all regular expressions quick implementation may be used 
(e.g. when it uses backreferences), but for major part of regular expressions 
this implementation works much faster.

In addition to Google Summer of Code I would willingly write a faster 
implementation of regular expressions in Python. Naturally, the 
implementation would be 100% compatible with the existing regexp engine.
What do you think about my proposition?

-- 
Bartlomiej Wolowiec


More information about the Python-Dev mailing list