[Python-Dev] regexp in Python
Bartłomiej Wołowiec
b.wolowiec at students.mimuw.edu.pl
Wed Mar 21 19:42:03 CET 2007
For some time I'm interested in regular expressions and Finite State Machine.
Recently, I saw that Python uses "Secret Labs' Regular Expression Engine",
which very often works too slow. Its pesymistic time complexity is O(2^n),
although other solutions, with time complexity O(n*m) ( O(n*m^2), m is the
length of the regular expression and n is the length of the text,
introduction to problem: http://swtch.com/~rsc/regexp/regexp1.html )
For example:
$ cat test.py
import re
import sys
if sys.argv[1:]:
n = int(sys.argv[1])
regexp = 'a?'*n + 'a'*n
print 'regexp=', regexp
str = 'a'*n + 'b' + 'a'*n
print 'str=', str
print re.findall(regexp, str)
$ time python test.py 20
regexp= a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaa
str= aaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaa
['aaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaa']
real 0m0.592s
user 0m0.508s
sys 0m0.028s
$ time python test.py 21
regexp= a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaa
str= aaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaa
['aaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaa']
real 0m1.018s
user 0m1.000s
sys 0m0.004s
$ time python test.py 22
regexp= a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaa
str= aaaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaaa
['aaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaa']
real 0m2.062s
user 0m1.992s
sys 0m0.028s
$ time python test.py 23
regexp= a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaa
str= aaaaaaaaaaaaaaaaaaaaaaabaaaaaaaaaaaaaaaaaaaaaaa
['aaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaa']
real 0m4.159s
user 0m4.092s
sys 0m0.024s
Unfortuntely not for all regular expressions quick implementation may be used
(e.g. when it uses backreferences), but for major part of regular expressions
this implementation works much faster.
In addition to Google Summer of Code I would willingly write a faster
implementation of regular expressions in Python. Naturally, the
implementation would be 100% compatible with the existing regexp engine.
What do you think about my proposition?
--
Bartlomiej Wolowiec
More information about the Python-Dev
mailing list