re.search much slower then grep on some regular expressions

Wed Jul 9 08:06:33 EDT 2008

John Machin wrote:

>> Hmm, unfortunately it's still orders of magnitude slower than grep in my
>> own application that involves matching lots of strings and regexps
>> against large files (I killed it after 400 seconds, compared to 1.5 for
>> grep), and that's leaving aside the much longer compilation time (over a
>> minute).  If the matching was fast then I could possibly pickle the
>> lexer though (but it's not).
>>
> 
> Can you give us some examples of the kinds of patterns that you are
> using in practice and are slow using Python re?

Trivial stuff like:

           (Str('error in pkg_delete'), ('mtree', 'mtree')),
           (Str('filesystem was touched prior to .make install'), 
('mtree', 'mtree')),
           (Str('list of extra files and directories'), ('mtree', 'mtree')),
           (Str('list of files present before this port was installed'), 
('mtree', 'mtree')),
           (Str('list of filesystem changes from before and after'), 
('mtree', 'mtree')),

           (re('Configuration .* not supported'), ('arch', 'arch')),

           (re('(configure: error:|Script.*configure.*failed 
unexpectedly|script.*failed: here are the contents of)'),
            ('configure_error', 'configure')),
...

There are about 150 of them and I want to find which is the first match 
in a text file that ranges from a few KB up to 512MB in size.

 > How large is "large"?
> What kind of text?

It's compiler/build output.

> Instead of grep, you might like to try nrgrep ... google("nrgrep
> Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
> postscript paper and C source findable from Gonzalo Navarro's home-
> page.

Thanks, looks interesting but I don't think it is the best fit here.  I 
would like to avoid spawning hundreds of processes to process each file 
(since I have tens of thousands of them to process).

Kris