re.search much slower then grep on some regular expressions
kris at FreeBSD.org
Wed Jul 9 21:46:50 CEST 2008
> On Jul 8, 11:01 am, Kris Kennaway <k... at FreeBSD.org> wrote:
>> samwyse wrote:
>>> You might want to look at Plex.
>>> "Another advantage of Plex is that it compiles all of the regular
>>> expressions into a single DFA. Once that's done, the input can be
>>> processed in a time proportional to the number of characters to be
>>> scanned, and independent of the number or complexity of the regular
>>> expressions. Python's existing regular expression matchers do not have
>>> this property. "
>> Hmm, unfortunately it's still orders of magnitude slower than grep in my
>> own application that involves matching lots of strings and regexps
>> against large files (I killed it after 400 seconds, compared to 1.5 for
>> grep), and that's leaving aside the much longer compilation time (over a
>> minute). If the matching was fast then I could possibly pickle the
>> lexer though (but it's not).
> That's funny, the compilation is almost instantaneous for me.
My lexicon was quite a bit bigger, containing about 150 strings and regexps.
> However, I just tested it to several files, the first containing
> 4875*'a', the rest each twice the size of the previous. And you're
> right, for each doubling of the file size, the match take four times
> as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
> Here are my results:
The docs say it is supposed to be linear in the file size ;-) ;-(
More information about the Python-list