RE: [Python-Dev] Algoritmic Complexity Attack on Python

May 31, 2003

      [Scott Crosby]
...
...
Also, I'd like to thank Tim Peters for telling me about the potential
of degradation that regular expressions may offer.
I'm acutely aware of that one because it burns people regularly.  These
aren't cases of hostile input, they're cases of innocently "erroneous"
input.  After maybe a year of experience, people using a backtracking regexp
engine usually figure out how to write a regexp that doesn't go
resource-crazy when parsing strings that *do* match.  Those are the inputs
the program expects.  But all inputs can suffers errors, and a regexp that
works well when the input matches can still go nuts trying to match a
non-matching string, consuming an exponential amount of time trying an
exponential number of futile backtracking possibilities.

Here's an unrealistic but tiny example, to get the flavor across:

"""
import re
pat = re.compile('(x+x+)+y')

from time import clock as now

for n in range(10, 100):
    print n,
    start = now()
    pat.search('x' * n + 'y')
    print now() - start,
    start = now()
    pat.search('x' * n)
    print now() - start
"""

The fixed regexp here is

    (x+x+)+y

and we search strings of the form

    xxx...xxxy      which do match
    xxx...xxx       which don't match

The matching cases take time linear in the length of the string, but it's so
fast it's hard to see the time going up at all until the string gets very
large.  The failing cases take time exponential in the length of the string.
Here's sample output:

10 0.000155885951826 0.00068891533549
11 1.59238337887e-005 0.0013736401884
12 1.76000268191e-005 0.00268777552423
13 2.43047989406e-005 0.00609379976198
14 2.51428954558e-005 0.0109438642954
15 3.4361957123e-005 0.0219815954005
16 3.10095710622e-005 0.0673058549423
17 3.26857640926e-005 0.108308050755
18 3.35238606078e-005 0.251965336328
19 3.68762466686e-005 0.334131480581
20 3.68762466685e-005 0.671073936875
21 3.60381501534e-005 1.33723327578
22 3.60381501534e-005 2.68076149449
23 3.6038150153e-005 5.37420757974
24 3.6038150153e-005 10.7601803584
25 3.52000536381e-005

I killed the program then, as I didn't want to wait 20+ seconds for the
25-character string to fail to match.

The horrid problem here is that it takes a highly educated eye to look at
that regexp and see in advance that it's going to have "fast when it
matches, possibly horrid when it doesn't match" behavior -- and this is a
dead easy case to analyze.  In a regexp that slobbers on across multiple
lines, with 5 levels of group nesting, my guess is that no more than 1
programmer in 1000 has even a vague idea how to start looking for such
problems.