[Scott Crosby]
... Also, I'd like to thank Tim Peters for telling me about the potential of degradation that regular expressions may offer.
I'm acutely aware of that one because it burns people regularly. These aren't cases of hostile input, they're cases of innocently "erroneous" input. After maybe a year of experience, people using a backtracking regexp engine usually figure out how to write a regexp that doesn't go resource-crazy when parsing strings that *do* match. Those are the inputs the program expects. But all inputs can suffers errors, and a regexp that works well when the input matches can still go nuts trying to match a non-matching string, consuming an exponential amount of time trying an exponential number of futile backtracking possibilities. Here's an unrealistic but tiny example, to get the flavor across: """ import re pat = re.compile('(x+x+)+y') from time import clock as now for n in range(10, 100): print n, start = now() pat.search('x' * n + 'y') print now() - start, start = now() pat.search('x' * n) print now() - start """ The fixed regexp here is (x+x+)+y and we search strings of the form xxx...xxxy which do match xxx...xxx which don't match The matching cases take time linear in the length of the string, but it's so fast it's hard to see the time going up at all until the string gets very large. The failing cases take time exponential in the length of the string. Here's sample output: 10 0.000155885951826 0.00068891533549 11 1.59238337887e-005 0.0013736401884 12 1.76000268191e-005 0.00268777552423 13 2.43047989406e-005 0.00609379976198 14 2.51428954558e-005 0.0109438642954 15 3.4361957123e-005 0.0219815954005 16 3.10095710622e-005 0.0673058549423 17 3.26857640926e-005 0.108308050755 18 3.35238606078e-005 0.251965336328 19 3.68762466686e-005 0.334131480581 20 3.68762466685e-005 0.671073936875 21 3.60381501534e-005 1.33723327578 22 3.60381501534e-005 2.68076149449 23 3.6038150153e-005 5.37420757974 24 3.6038150153e-005 10.7601803584 25 3.52000536381e-005 I killed the program then, as I didn't want to wait 20+ seconds for the 25-character string to fail to match. The horrid problem here is that it takes a highly educated eye to look at that regexp and see in advance that it's going to have "fast when it matches, possibly horrid when it doesn't match" behavior -- and this is a dead easy case to analyze. In a regexp that slobbers on across multiple lines, with 5 levels of group nesting, my guess is that no more than 1 programmer in 1000 has even a vague idea how to start looking for such problems.