re.search much slower then grep on some regular expressions
samwyse
samwyse at gmail.com
Wed Jul 9 15:29:00 EDT 2008
On Jul 8, 11:01 am, Kris Kennaway <k... at FreeBSD.org> wrote:
> samwyse wrote:
> > You might want to look at Plex.
> >http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
>
> > "Another advantage of Plex is that it compiles all of the regular
> > expressions into a single DFA. Once that's done, the input can be
> > processed in a time proportional to the number of characters to be
> > scanned, and independent of the number or complexity of the regular
> > expressions. Python's existing regular expression matchers do not have
> > this property. "
> Hmm, unfortunately it's still orders of magnitude slower than grep in my
> own application that involves matching lots of strings and regexps
> against large files (I killed it after 400 seconds, compared to 1.5 for
> grep), and that's leaving aside the much longer compilation time (over a
> minute). If the matching was fast then I could possibly pickle the
> lexer though (but it's not).
That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous. And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
Here are my results:
compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs
And here's my (probably over-engineered) testbed:
from __future__ import with_statement
from os.path import exists
from timeit import Timer
from Plex import *
filename = "file-%d.txt"
def create_files(n):
for x in range(0,n):
fname = filename % x
if not exists(fname):
print 'creating', fname
with open(fname, 'w') as f:
print >>f, (4875*2**x)*'a',
def compile_lexicon():
global lexicon
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
def test(fname):
with open(fname, 'r') as f:
scanner = Scanner(lexicon, f, fname)
while 1:
token = scanner.read()
#print token
if token[0] is None:
break
def my_timed_test(func_name, *args):
stmt = func_name + '(' + ','.join(map(repr, args)) + ')'
t = Timer(stmt, "from __main__ import "+func_name)
print stmt, 'took', t.timeit(1), 'secs'
if __name__ == '__main__':
create_files(6)
my_timed_test('compile_lexicon')
for x in range(0,4):
my_timed_test('test', filename%x)
More information about the Python-list
mailing list