[Python-bugs-list] [ python-Bugs-753711 ] Infinite Loop in RE
SourceForge.net
noreply@sourceforge.net
Mon, 16 Jun 2003 11:44:34 -0700
Bugs item #753711, was opened at 2003-06-12 19:27
Message generated for change (Comment added) made by glchapman
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=753711&group_id=5470
Category: Regular Expressions
Group: Python 2.2.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Matthew Bogosian (mbogosian)
Assigned to: Fredrik Lundh (effbot)
Summary: Infinite Loop in RE
Initial Comment:
This *may* be similar to
<http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=439997>,
but I don't think the current behavior (infinite
loop/unbound execution) is correct.
Here's the python version:
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
#!/usr/bin/python
import re
pattern =
'^UPDATE\s+\w+\s+SET\s+locked_until\s*=\s*(\S+\s*)+,'
# This won't match (missing ',' at end
str = 'UPDATE arena_teams SET locked_until =
CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)'
re.search(pattern, str, re.I)
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
When run, this script just pegs the CPU and hangs (I'm
running RedHat 8.0). Note: if I change the pattern
slightly it still doesn't work:
pattern = '^UPDATE\s+\w+\s+SET\s+locked_until\s*=\s*([^
=,]+\s*)+,'
It's not until the pattern actually matches that things
get better (with both the original and modified patterns):
# Pattern now matches (added a ',' at the end)
str = 'UPDATE arena_teams SET locked_until =
CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP),'
I tried doing the same thing in Perl:
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
#!/usr/bin/perl
'UPDATE arena_teams SET locked_until =
CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)' =~
'/UPDATE\s+\w+\s+SET\s+locked_until\s*=\s*(\S+\s*)+,/i';
- - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
This of course runs just fine. Any ideas?
----------------------------------------------------------------------
Comment By: Greg Chapman (glchapman)
Date: 2003-06-16 10:44
Message:
Logged In: YES
user_id=86307
FWIW, one way to efficiently match this sort of pattern
(where you have literal text at the end of something
complex) would be to use the match once and don't
backtrack operator ('(?>pattern)'). SRE doesn't have the
(?> operator, but it can be emulated by '(?=(pattern))\1'.
So one way to avoid the exponential time would be to use
something like this:
rx = re.compile(
r'''^UPDATE\s+\w+\s+SET\s+locked_until\s*=\s*
(?=(
([^=,]+\s*)+
))\1
, #trailing literal
''',
re.I | re.X
)
By the way, it would be fairly easy to add the (?> operator to
SRE; it's almost identical to a postive lookahead assertion.
The only difference is that, upon successfully matching the
subexpression inside the parentheses, the position in the
string being matched is advanced to the end of the text
matched by the subexpression.
Also, in case anyone's interested in why Perl is able to fail so
quickly here, it appears that the Perl engine keeps track of
positions (in the target text) where something like
'(pattern)+' has already been tried and has failed, so it can
quickly fail if backtracking causes an attempt to match again
at that position (at least that's my interpretation of the
numerous "already tried at this position..." messages in the
trace from running the supplied pattern and text with "use
re 'debug';").
----------------------------------------------------------------------
Comment By: Brett Cannon (bcannon)
Date: 2003-06-14 20:28
Message:
Logged In: YES
user_id=357491
sjones is right about it taking exponential time. A regex engine
works with greedy quantifiers by sucking up all that it can and
then trying the next part of the regex and continues until it fails
or succeeds. If it fails, though, it backtracks one regex character,
gives up a character it claimed, and tries again. It does this for a
long time.
How long did you wait for the regex to complete? I bet if you left
it for a *very* long time it would complete.
----------------------------------------------------------------------
Comment By: Shannon Jones (sjones)
Date: 2003-06-13 13:42
Message:
Logged In: YES
user_id=589306
If you change your regex and sample data to the following
simpler forms, you get the same problem:
pattern = '(\S+\s*)+,'
str = 'CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)'
If you remove some letters from str, the search does finish
but it takes several seconds. I believe the problem is that
your regex takes exponential time to execute based on the
length of str. Here is a recent python-dev post that talks
about the problem:
http://mail.python.org/pipermail/python-dev/2003-May/035916.html
The link you provided in the bug report talks about this as
well. You can also do a web search for "regular expression
exponential" to see more information. I believe that any
regex engine (that has similar features to Python's engine)
will have some regular expressions that take exponential
time or space. Perl has some cases as well (search for
"exponential" in perldoc perlre). This is just how regular
expressions are and IMHO does not indicate a bug in Python.
As far as how to fix the regular expression, I can't say.
I'm sure there is a way to "rephrase" what you want to get
it to work. Maybe try asking in the comp.lang.python newsgroup.
Good luck!
----------------------------------------------------------------------
Comment By: Shannon Jones (sjones)
Date: 2003-06-13 04:15
Message:
Logged In: YES
user_id=589306
If you change your regex and sample data to the following
simpler forms, you get the same problem:
pattern = '(\S+\s*)+,'
str = 'CAST(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP)'
If you remove some letters from str, the search does finish
but it takes several seconds. I believe the problem is that
your regex takes exponential time to execute based on the
length of str. Here is a recent python-dev post that talks
about the problem:
http://mail.python.org/pipermail/python-dev/2003-May/035916.html
The link you provided in the bug report talks about this as
well. You can also do a web search for "regular expression
exponential" to see more information. I believe that any
regex engine (that has similar features to Python's engine)
will have some regular expressions that take exponential
time or space. Perl has some cases as well (search for
"exponential" in perldoc perlre). This is just how regular
expressions are and IMHO does not indicate a bug in Python.
As far as how to fix the regular expression, I can't say.
I'm sure there is a way to "rephrase" what you want to get
it to work. Maybe try asking in the comp.lang.python newsgroup.
Good luck!
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=753711&group_id=5470