re.search much slower then grep on some regular expressions

Wed Jul 9 15:29:00 EDT 2008

On Jul 8, 11:01 am, Kris Kennaway <k... at FreeBSD.org> wrote:
> samwyse wrote:

> > You might want to look at Plex.
> >http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/
>
> > "Another advantage of Plex is that it compiles all of the regular
> > expressions into a single DFA. Once that's done, the input can be
> > processed in a time proportional to the number of characters to be
> > scanned, and independent of the number or complexity of the regular
> > expressions. Python's existing regular expression matchers do not have
> > this property. "

> Hmm, unfortunately it's still orders of magnitude slower than grep in my
> own application that involves matching lots of strings and regexps
> against large files (I killed it after 400 seconds, compared to 1.5 for
> grep), and that's leaving aside the much longer compilation time (over a
> minute).  If the matching was fast then I could possibly pickle the
> lexer though (but it's not).

That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous.  And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2).  156000*'a' would probably take 8 hours.
Here are my results:

compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs

And here's my (probably over-engineered) testbed:

from __future__ import with_statement
from os.path import exists
from timeit import Timer

from Plex import *

filename = "file-%d.txt"

def create_files(n):
    for x in range(0,n):
        fname = filename % x
        if not exists(fname):
            print 'creating', fname
            with open(fname, 'w') as f:
                print >>f, (4875*2**x)*'a',

def compile_lexicon():
    global lexicon
    lexicon = Lexicon([
        (Rep(AnyBut(' "='))+Str('/'),  TEXT),
        (AnyBut('\n'), IGNORE),
    ])

def test(fname):
    with open(fname, 'r') as f:
        scanner = Scanner(lexicon, f, fname)
        while 1:
            token = scanner.read()
            #print token
            if token[0] is None:
                break

def my_timed_test(func_name, *args):
    stmt = func_name + '(' + ','.join(map(repr, args)) + ')'
    t = Timer(stmt, "from __main__ import "+func_name)
    print stmt, 'took', t.timeit(1), 'secs'

if __name__ == '__main__':
    create_files(6)
    my_timed_test('compile_lexicon')
    for x in range(0,4):
        my_timed_test('test', filename%x)