re.search much slower then grep on some regular expressions

Mon Jul 7 15:27:43 EDT 2008

Paddy wrote:
> On Jul 4, 1:36 pm, Peter Otten <__pete... at web.de> wrote:
>> Henning_Thornblad wrote:
>>> What can be the cause of the large difference between re.search and
>>> grep?
>> grep uses a smarter algorithm ;)
>>
>>
>>
>>> This script takes about 5 min to run on my computer:
>>> #!/usr/bin/env python
>>> import re
>>> row=""
>>> for a in range(156000):
>>>     row+="a"
>>> print re.search('[^ "=]*/',row)
>>> While doing a simple grep:
>>> grep '[^ "=]*/' input                  (input contains 156.000 a in
>>> one row)
>>> doesn't even take a second.
>>> Is this a bug in python?
>> You could call this a performance bug, but it's not common enough in real
>> code to get the necessary brain cycles from the core developers.
>> So you can either write a patch yourself or use a workaround.
>>
>> re.search('[^ "=]*/', row) if "/" in row else None
>>
>> might be good enough.
>>
>> Peter
> 
> It is not a smarter algorithm that is used in grep. Python RE's have
> more capabilities than grep RE's which need a slower, more complex
> algorithm.
> You could argue that if the costly RE features are not used then maybe
> simpler, faster algorithms should be automatically swapped in but ....

I can and do :-)

It's a major problem that regular expression parsing in python has 
exponential complexity when polynomial algorithms (for a subset of 
regexp expressions, e.g. excluding back-references) are well-known.

It rules out using python for entire classes of applications where 
regexp parsing is on the critical path.

Kris