re.search much slower then grep on some regular expressions

Wed Jul 9 02:17:25 EDT 2008

On Jul 8, 2:48 am, John Machin <sjmac... at lexicon.net> wrote:
> On Jul 8, 2:51 am, Henning Thornblad <Henning.Thornb... at gmail.com>
> wrote:
>
>
>
> > When trying to find an alternative way of solving my problem i found
> > that running this script:
>
> > #!/usr/bin/env python
>
> > import re
>
> > row=""
> > for a in range(156000):
> >     row+="a"
> > print "How many, dude?"
> > print re.search('/[^ "=]*',row)  (the / has moved)
>
> > wouldn't take even a second (The re.search part of course)
>
> > This is for me very strange since this,
> > in theory, is the same problem as before.
>
> In theory. In practice, it is the same problem *iff* you start at the
> end of the text and work backwards.
>
> Your original pattern can be characterised as X*Y, where X and Y are
> character classes; your new one is YX*. You are asking each to search
> a text that contains many Xs and no Ys.
>
> The second pattern will cause a naive search engine to examine each
> character in the text (is it a Y?) as a candidate for the start of a
> successful match; each test fails, and the whole expedition has cost
> O(N).
>
> OTOH, the first pattern will start at offset 0, cheerfully cruising
> past all those lovely Xs, but failing (no Y) at the end of the text.
> It will then think that it's been too greedy, reduce the matched span
> of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
> that's O(N) just for the first trial starting from offset 0. Being
> naive, it will try again starting from offset 1, 2, 3, ... an O(N**2)
> process. If Y were instead something more complicated, it would be
> worse than O(N**2).
>
> If you were to tell us what you are actually trying to do, we could
> probably help you with a faster method.
>
> BTW, if the text is 'aaa///bbb///ccc', your original search will find
> 'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
>      '[^ "=/]*/'
>
> Cheers,
> John

What we had was a regex that searched textfiles, row for row,
for an absolute path starting/ending with  ", = or blank space.

The path have a known signature which is pretty unique and the regex
looked like this:
[^ ="]*/[^ ="]*signature[^ =]*/

So we didn't really see any performance loss until we got an textfile
with some really long rows, one reaching 156,000 characters. We
removed the cause of these long rows.
But also decided to tune the regex making it more scalable.

I have though of two possible soultions:
1. Invert [^ ="] so that it becomes [alot of charcters you could find
in a path]
2. Add [ ="]:  [ ="][^ ="]*/[^ ="]*signature[^ =]*/[ ="] and just trim
them later
They both work good performance wise but i can't figure out which one
is the less ugly.

Anyway, I'm glad I asked the question. I have learned more about regex
then i ever thought i'd need :)

Many Thanks
Henning Thornblad