Pathological regular expression
sjmachin at lexicon.net
Sun Apr 12 01:46:20 CEST 2009
On Apr 12, 3:40 am, Steven D'Aprano <st... at REMOVE-THIS-
> On Sat, 11 Apr 2009 08:40:03 -0700, John Machin wrote:
> >> To my mind, this is a bug in the RE engine. Is there any reason to not
> >> treat it as a bug?
> > IMHO it's not a bug -- s/hang/takes a long time to compute/
> > Just look at it: 2 + operators and 3 * operators ... It's one of those
> > "come back after lunch" REs.
> Well, it's been running now for about two and a half hours, that's a
> rather long lunch. And despite MRAB's assertion, it *cannot* be
> interrupted by ctrl-C. That means that to all intents and purposes, the
> interpreter has locked up for the duration of the calculation, which may
> be days or weeks for all I know.
If you don't know, experiment!
import re, time
_re_comments = re.compile(r'^(([^\\]+|\\.|"([^"\\]+|\\.)*")*)#.*$')
for end in ("# yadda", ""):
for start in ("", '"#"'):
for size in xrange(25):
line = start + "x" * size + end
t0 = time.clock()
result = _re_comments_nc.sub(r"\1", line)
t1 = time.clock()
print "%s (%.6f secs)" % (result, t1 - t0)
1. Runs at light speed in first two cases (genuine comemnt to be
2. Takes approx O(2**N) in the 3rd case (no # in sight)
3. Takes even longer to produce a *WRONG* result in the 4th case (# in
a string literal)
4. The RE assumes that only " can quote a string literal; what about '
''' and """?
1. if "#" not in line:
2. try timing on short pieces of input
3. Test the RE, get it correct
More information about the Python-list