[Python-bugs-list] [ python-Bugs-537533 ] Grossly inefficient re

Sun, 31 Mar 2002 16:14:07 -0800

Bugs item #537533, was opened at 2002-03-31 17:54
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=537533&group_id=5470

Category: Regular Expressions
>Group: Not a Bug
>Status: Closed
>Resolution: Invalid
Priority: 5
Submitted By: Pekka Pessi (ppessi)
>Assigned to: Tim Peters (tim_one)
Summary: Grossly inefficient re

Initial Comment:
Here is a regexp (ported from a awk script,
supposed to match C comments):

/[*][^*]*([*]+[^/][^*]+)*[*]+/

The regexp is *sometimes* *very* slow when run in
Python. Also, while replacing the regexp with
sub(), the Python interpreter is in deep recursion
in sre and does not respond to any signals.

I *think* this should be a O(N) regexp.

An example script demonstrating the problem is
attached. I have an example where adding two
characters to the input makes the re.sub() some
250000 times longer to execute.

There was no noticeable difference between sre and
pre modules.

I have a RedHat 7.2 box with a 600 MHz
PentiumIII, I'm using is Python 2.2 Linux RPMs
by Sean Reifschneider <jafo-rpms@tummy.com>:

Python 2.2 (#1, Dec 23 2001, 09:30:32) 
[GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)] on linux2

Same problem applies to the Python 1.5.2 provided by
Redhat:
Python 1.5.2 (#1, Jul  5 2001, 03:02:19)  [GCC 2.96
20000731 (Red Hat Linux 7.1 2 on linux-i386
Copyright 1991-1995 Stichting Mathematisch Centrum,
Amsterdam

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2002-03-31 19:14

Message:
Logged In: YES 
user_id=31435

No, it's not necessarily linear-time when it fails to 
match, as it does so fail in your second string.  Then you 
have 17 instances of "**" stacked up via your [*]+, and it 
has to backtrack through 2**17 = 128K possibilities (at 
each of the 17 points, whether to match one or two *s).  It 
will run enormously faster in fails-to-match cases if you 
simply change [*]+ to [*].  The failing case will run just 
as fast as the successful one then.

If you've read Friedl, the general unrolling pattern is

normal* (special normal*)*

Changing special to special+ instead can be a disaster in 
failing cases, as you've discovered here.  That's why 
Friedl didn't write it "special+" to begin with <wink>.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=537533&group_id=5470