re.search much slower then grep on some regular expressions
Kris Kennaway
kris at FreeBSD.org
Mon Jul 7 15:27:43 EDT 2008
Paddy wrote:
> On Jul 4, 1:36 pm, Peter Otten <__pete... at web.de> wrote:
>> Henning_Thornblad wrote:
>>> What can be the cause of the large difference between re.search and
>>> grep?
>> grep uses a smarter algorithm ;)
>>
>>
>>
>>> This script takes about 5 min to run on my computer:
>>> #!/usr/bin/env python
>>> import re
>>> row=""
>>> for a in range(156000):
>>> row+="a"
>>> print re.search('[^ "=]*/',row)
>>> While doing a simple grep:
>>> grep '[^ "=]*/' input (input contains 156.000 a in
>>> one row)
>>> doesn't even take a second.
>>> Is this a bug in python?
>> You could call this a performance bug, but it's not common enough in real
>> code to get the necessary brain cycles from the core developers.
>> So you can either write a patch yourself or use a workaround.
>>
>> re.search('[^ "=]*/', row) if "/" in row else None
>>
>> might be good enough.
>>
>> Peter
>
> It is not a smarter algorithm that is used in grep. Python RE's have
> more capabilities than grep RE's which need a slower, more complex
> algorithm.
> You could argue that if the costly RE features are not used then maybe
> simpler, faster algorithms should be automatically swapped in but ....
I can and do :-)
It's a major problem that regular expression parsing in python has
exponential complexity when polynomial algorithms (for a subset of
regexp expressions, e.g. excluding back-references) are well-known.
It rules out using python for entire classes of applications where
regexp parsing is on the critical path.
Kris
More information about the Python-list
mailing list