much slower then grep on some regular expressions

John Machin sjmachin at
Thu Jul 10 04:20:41 CEST 2008

On Jul 9, 10:06 pm, Kris Kennaway <k... at> wrote:
> John Machin wrote:
> >> Hmm, unfortunately it's still orders of magnitude slower than grep in my
> >> own application that involves matching lots of strings and regexps
> >> against large files (I killed it after 400 seconds, compared to 1.5 for
> >> grep), and that's leaving aside the much longer compilation time (over a
> >> minute).  If the matching was fast then I could possibly pickle the
> >> lexer though (but it's not).
> > Can you give us some examples of the kinds of patterns that you are
> > using in practice and are slow using Python re?
> Trivial stuff like:
>            (Str('error in pkg_delete'), ('mtree', 'mtree')),
>            (Str('filesystem was touched prior to .make install'),
> ('mtree', 'mtree')),
>            (Str('list of extra files and directories'), ('mtree', 'mtree')),
>            (Str('list of files present before this port was installed'),
> ('mtree', 'mtree')),
>            (Str('list of filesystem changes from before and after'),
> ('mtree', 'mtree')),
>            (re('Configuration .* not supported'), ('arch', 'arch')),
>            (re('(configure: error:|Script.*configure.*failed
> unexpectedly|script.*failed: here are the contents of)'),
>             ('configure_error', 'configure')),
> ...
> There are about 150 of them and I want to find which is the first match
> in a text file that ranges from a few KB up to 512MB in size.
>  > How large is "large"?
> > What kind of text?
> It's compiler/build output.
> > Instead of grep, you might like to try nrgrep ... google("nrgrep
> > Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
> > postscript paper and C source findable from Gonzalo Navarro's home-
> > page.
> Thanks, looks interesting but I don't think it is the best fit here.  I
> would like to avoid spawning hundreds of processes to process each file
> (since I have tens of thousands of them to process).

Uh-huh ... try this, then:

You could use this to find the "Str" cases and the prefixes of the
"re" cases (which seem to be no more complicated than 'foo.*bar.*zot')
and use something slower like Python's re to search the remainder of
the line for 'bar.*zot'.


More information about the Python-list mailing list