Why TERRIBLE performance of regular expressions in re module ?

Tim Peters tim_one at email.msn.com
Tue Oct 5 01:06:04 EDT 1999


[rturpin at my-deja.com]
> While debugging a CGI script, I came across the following
> strange behavior: a regular expression that seems to
> require exponential time to recognize a string.
> ...
> It doesn't take a genius to recognize this progression
> and decide not to proceed.  What puzzles me is this:
> regular expressions have a well-recognized EFFICIENT
> recognition algorithm: they are turned into state machines
> that perform LINEARLY on the length of the input string.

Python doesn't use that approach; neither does Perl, for that matter.  See
Jeffrey Friedl's book "Mastering Regular Expressions" (O'Reilly) for
details.  One hangup is that Python/Perl "regular expressions" aren't
regular -- tricks like backreferences and capturing parens go beyond what a
DFA can do.  Another hangup is that a linear-time DFA recognizer can take
exponential time to build.  A third hangup is that a DFA recognizer isn't
necessarily the fastest for a large class of regexps, although Perl is much
better than Python at exploiting the possibilities here (e.g., Perl can do a
fast Boyer-Moore search for a literal embedded substring that needs to
match, skipping-- at high speed --over vast stretches of input that can't
possibly match overall due to not containing a match for the embedded
literal).

> I don't think I am doing anything strange.  (Though I don't
> rule it out).

You probably have nested quantifiers ("+" and "*" thingies) that suffer
exponential ambiguity.  Post the regexp and someone may speed it up.  See
Friedl's book for a long and detailed course in what's needed.

> Has anyone seen this behavior,

Thousands.

> or does anyone have an explanation for it?

One or two <wink>.

yet-another-happy-regexp-user-ly y'rs  - tim






More information about the Python-list mailing list