re.search much slower then grep on some regular expressions

Sebastian "lunar" Wiesner basti.wiesner at gmx.net
Sun Jul 6 20:13:37 CEST 2008


Mark Wooding <mdw at distorted.org.uk>:

> Sebastian "lunar" Wiesner <basti.wiesner at gmx.net> wrote:
> 
>> I just wanted to illustrate, that the speed of the given search is
>> somehow related to the complexity of the engine.
>>
>> Btw, other pcre implementation are as slow as Python or "grep -P".  I
>> tried a sample C++-code using pcre++ (a wrapper around libpcre) and saw
>> it running equally long.
> 
> So some other implementations are equally poor.  I note that Perl itself
> handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

I don't know about the latter, but perl doesn't use any smart algorithm, it
just heavily relies on memoization to gain speed.

This makes perl perform poorly in other situations, as mentioned in the
paper cited by Mark Dickinson:

# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault  perl -e '("a" x 100000) =~ /^(ab?)*$/;'

In Python on the other, this pattern works fine, at the cost of performance
losses on other patterns.

It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).

> Yes, Perl-compatible `regular expressions' are more complicated than
> POSIX extended regular expressions; but that doesn't mean that you have
> to implement them in a dumb way. Indeed, it stands to reason that
> expressions describing truly regular languages can be matched using the 
> same old algorithms that grep uses. 

I completely agree.  I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement.  But I'm not an expert in this, probably some
libraries out there already do this.  In this case, it'd be time to give
pythons re engine a more sophisticated implementation ;)

-- 
Freedom is always the freedom of dissenters.
                                      (Rosa Luxemburg)



More information about the Python-list mailing list