re.search much slower then grep on some regular expressions

Sat Jul 5 11:13:36 EDT 2008

On Jul 5, 1:54 pm, Carl Banks <pavlovevide... at gmail.com> wrote:
> I don't think you've illustrated that at all.  What you've illustrated
> is that one implementation of regexp optimizes something that another
> doesn't.  It might be due to differences in complexity; it might not.
> (Maybe there's something about PCREs that precludes the optimization
> that the default grep uses, but I'd be inclined to think not.)

It seems like an appropriate moment to point out *this* paper:

http://swtch.com/~rsc/regexp/regexp1.html

Apparently, grep and Tcl convert a regex to a finite state machine.
Matching is then *very* fast:  essentially linear time in the
length of the string being matched, even in the worst case.
Though it is possible for the size of the finite state machine
to grow exponentially with the size of the regex.

But not all PCREs can be converted to a finite state machine, so
Perl, Python, etc. use a backtracking approach, which has
exponential running time in the worst case.  In particular,
it's not possible to use a finite state machine to represent
a regular expression that contains backreferences.

Part of the problem is a lack of agreement on what
'regular expression' means.   Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'.  See

http://en.wikipedia.org/wiki/Regular_expression

Mark