[Python-bugs-list] [ python-Bugs-753711 ] Infinite Loop in RE

Mon, 16 Jun 2003 11:44:34 -0700

Bugs item #753711, was opened at 2003-06-12 19:27
Message generated for change (Comment added) made by glchapman
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=753711&group_id=5470

Category: Regular Expressions
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Matthew Bogosian (mbogosian)
Assigned to: Fredrik Lundh (effbot)
Summary: Infinite Loop in RE

Initial Comment:
This *may* be similar to
<http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=439997>,
but I don't think the current behavior (infinite
loop/unbound execution) is correct.

Here's the python version:

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - 

#!/usr/bin/python

import re

pattern =
'^UPDATE\s+\w+\s+SET\s+locked_until\s*=\s*(\S+\s*)+,'

# This won't match (missing ',' at end
str = 'UPDATE arena_teams SET locked_until =
CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)'

re.search(pattern, str, re.I)

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - 

When run, this script just pegs the CPU and hangs (I'm
running RedHat 8.0). Note: if I change the pattern
slightly it still doesn't work:

pattern = '^UPDATE\s+\w+\s+SET\s+locked_until\s*=\s*([^
=,]+\s*)+,'

It's not until the pattern actually matches that things
get better (with both the original and modified patterns):

# Pattern now matches (added a ',' at the end)
str = 'UPDATE arena_teams SET locked_until =
CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP),'

I tried doing the same thing in Perl:

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - 

#!/usr/bin/perl

'UPDATE arena_teams SET locked_until =
CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)' =~
'/UPDATE\s+\w+\s+SET\s+locked_until\s*=\s*(\S+\s*)+,/i';

- - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - 

This of course runs just fine. Any ideas?

----------------------------------------------------------------------

Comment By: Greg Chapman (glchapman)
Date: 2003-06-16 10:44

Message:
Logged In: YES 
user_id=86307

FWIW, one way to efficiently match this sort of pattern 
(where you have literal text at the end of something 
complex) would be to use the match once and don't 
backtrack operator ('(?>pattern)').  SRE doesn't have the 
(?> operator, but it can be emulated by '(?=(pattern))\1'.  
So one way to avoid the exponential time would be to use 
something like this:

rx = re.compile(
   r'''^UPDATE\s+\w+\s+SET\s+locked_until\s*=\s*
   (?=(
       ([^=,]+\s*)+
   ))\1
   ,         #trailing literal
   ''', 
   re.I | re.X
)

By the way, it would be fairly easy to add the (?> operator to 
SRE; it's almost identical to a postive lookahead assertion.  
The only difference is that, upon successfully matching the 
subexpression inside the parentheses, the position in the 
string being matched is advanced to the end of the text 
matched by the subexpression.

Also, in case anyone's interested in why Perl is able to fail so 
quickly here, it appears that the Perl engine keeps track of 
positions (in the target text) where something like 
'(pattern)+' has already been tried and has failed, so it can 
quickly fail if backtracking causes an attempt to match again 
at that position (at least that's my interpretation of the 
numerous "already tried at this position..." messages in the 
trace from running the supplied pattern and text with "use 
re 'debug';").

----------------------------------------------------------------------

Comment By: Brett Cannon (bcannon)
Date: 2003-06-14 20:28

Message:
Logged In: YES 
user_id=357491

sjones is right about it taking exponential time.  A regex engine 
works with greedy quantifiers by sucking up all that it can and 
then trying the next part of the regex and continues until it fails 
or succeeds.  If it fails, though, it backtracks one regex character, 
gives up a character it claimed, and tries again.  It does this for a 
long time.

How long did you wait for the regex to complete?  I bet if you left 
it for a *very* long time it would complete.

----------------------------------------------------------------------

Comment By: Shannon Jones (sjones)
Date: 2003-06-13 13:42

Message:
Logged In: YES 
user_id=589306

If you change your regex and sample data to the following
simpler forms, you get the same problem:

pattern = '(\S+\s*)+,'
str = 'CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)'

If you remove some letters from str, the search does finish
but it takes several seconds. I believe the problem is that
your regex takes exponential time to execute based on the
length of str. Here is a recent python-dev post that talks
about the problem:

http://mail.python.org/pipermail/python-dev/2003-May/035916.html

The link you provided in the bug report talks about this as
well. You can also do a web search for "regular expression
exponential" to see more information. I believe that any
regex engine (that has similar features to Python's engine)
will have some regular expressions that take exponential
time or space. Perl has some cases as well (search for
"exponential" in perldoc perlre).  This is just how regular
expressions are and IMHO does not indicate a bug in Python.

As far as how to fix the regular expression, I can't say.
I'm sure there is a way to "rephrase" what you want to get
it to work. Maybe try asking in the comp.lang.python newsgroup.

Good luck!

----------------------------------------------------------------------

Comment By: Shannon Jones (sjones)
Date: 2003-06-13 04:15

Message:
Logged In: YES 
user_id=589306

If you change your regex and sample data to the following
simpler forms, you get the same problem:

pattern = '(\S+\s*)+,'
str = 'CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)'

If you remove some letters from str, the search does finish
but it takes several seconds. I believe the problem is that
your regex takes exponential time to execute based on the
length of str. Here is a recent python-dev post that talks
about the problem:

http://mail.python.org/pipermail/python-dev/2003-May/035916.html

The link you provided in the bug report talks about this as
well. You can also do a web search for "regular expression
exponential" to see more information. I believe that any
regex engine (that has similar features to Python's engine)
will have some regular expressions that take exponential
time or space. Perl has some cases as well (search for
"exponential" in perldoc perlre).  This is just how regular
expressions are and IMHO does not indicate a bug in Python.

As far as how to fix the regular expression, I can't say.
I'm sure there is a way to "rephrase" what you want to get
it to work. Maybe try asking in the comp.lang.python newsgroup.

Good luck!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=753711&group_id=5470