re.search much slower then grep on some regular expressions

Carl Banks pavlovevidence at gmail.com
Sat Jul 5 05:34:03 CEST 2008


On Jul 4, 4:43 pm, "Filipe Fernandes" <fernandes... at gmail.com> wrote:
> On Fri, Jul 4, 2008 at 8:36 AM, Peter Otten <__pete... at web.de> wrote:
> > Henning_Thornblad wrote:
>
> >> What can be the cause of the large difference between re.search and
> >> grep?
>
> > grep uses a smarter algorithm ;)
>
> >> This script takes about 5 min to run on my computer:
> >> #!/usr/bin/env python
> >> import re
>
> >> row=""
> >> for a in range(156000):
> >>     row+="a"
> >> print re.search('[^ "=]*/',row)
>
> >> While doing a simple grep:
> >> grep '[^ "=]*/' input                  (input contains 156.000 a in
> >> one row)
> >> doesn't even take a second.
>
> >> Is this a bug in python?
>
> > You could call this a performance bug, but it's not common enough in real
> > code to get the necessary brain cycles from the core developers.
> > So you can either write a patch yourself or use a workaround.
>
> > re.search('[^ "=]*/', row) if "/" in row else None
>
> > might be good enough.
>
> Wow... I'm rather surprised at how slow this is... using re.match
> yields much quicker results, but of course it's not quite the same as
> re.search
>
> Incidentally, if you add the '/' to "row" at the end of the string,
> re.search returns instantly with a match object.

This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I don't think it's possible to avoid quadratic behavior in regexps in
general, but clearly there are ways to optimize in some cases.

I'm guessing that grep builds a table of locations of individual
characters as it scans and, when the regexp exhausts the input, it
tries to backtrack to the last slash it saw, except there wasn't one
so it knows the regexp cannot be satisfied and it exits early.


> @ Peter
> I'm not versed enough in regex to tell if this is a bug or not
> (although I suspect it is),

I'm pretty sure it isn't: the regexp documentation makes no claims as
to the performance of the regexp that I'm aware of.


> but why would you say this particular
> regex isn't common enough in real code?

When re.search regexps start with things like [^...]* or .*, typically
the excluded characters are a typically found frequently in the
input.  For example, the pattern .*hello.* could be used to find a
line with hello in it, with the expectation that there are lots of
newlines.  But if there aren't any newlines the regexp wouldn't be
very useful.



Carl Banks



More information about the Python-list mailing list