re.search much slower then grep on some regular expressions

Sebastian "lunar" Wiesner basti.wiesner at gmx.net
Sat Jul 5 12:44:05 CEST 2008


Carl Banks <pavlovevidence at gmail.com>:

> On Jul 5, 4:12 am, "Sebastian \"lunar\" Wiesner"
> <basti.wies... at gmx.net> wrote:
>> Paddy <paddy3... at googlemail.com>:
>>
>>
>>
>> > On Jul 4, 1:36 pm, Peter Otten <__pete... at web.de> wrote:
>> >> Henning_Thornblad wrote:
>> >> > What can be the cause of the large difference between re.search and
>> >> > grep?
>>
>> >> grep uses a smarter algorithm ;)
>>
>> >> > This script takes about 5 min to run on my computer:
>> >> > #!/usr/bin/env python
>> >> > import re
>>
>> >> > row=""
>> >> > for a in range(156000):
>> >> > row+="a"
>> >> > print re.search('[^ "=]*/',row)
>>
>> >> > While doing a simple grep:
>> >> > grep '[^ "=]*/' input                  (input contains 156.000 a in
>> >> > one row)
>> >> > doesn't even take a second.
>>
>> >> > Is this a bug in python?
>>
>> >> You could call this a performance bug, but it's not common enough in
>> >> real code to get the necessary brain cycles from the core developers.
>> >> So you can either write a patch yourself or use a workaround.
>>
>> >> re.search('[^ "=]*/', row) if "/" in row else None
>>
>> >> might be good enough.
>>
>> >> Peter
>>
>> > It is not a smarter algorithm that is used in grep. Python RE's have
>> > more capabilities than grep RE's which need a slower, more complex
>> > algorithm.
>>
>> FWIW, grep itself can confirm this statement.  The following command
>> roughly takes as long as Python's re.search:
>>
>> # grep -P '[^ "=]*/' input
>>
>> -P tells grep to use real perl-compatible regular expressions.
> 
> This confirms that a particular engine might not be optimized for it,
> but it's not necessarily a reflection that the engine is more complex.

My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions.  The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.  

Btw, other pcre implementation are as slow as Python or "grep -P".  I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.


-- 
Freedom is always the freedom of dissenters.
                                      (Rosa Luxemburg)



More information about the Python-list mailing list