[Python-ideas] Re: Regex timeouts

Feb. 14, 2022

      """
Some people, when confronted with a problem, think “I know, I'll use
regular expressions.”  Now they have two problems.

- Jamie Zawinski
"""

Even more true of regexps than of floating point, and even of daemon threads ;-)

regex is a terrific module, incorporating many features that newer
regexp implementations support.

But there's a downside: if anyone was inclined to work on Python's
regexp engine, the existence of that more refined alternative is a
huge disincentive. Why bother? regex already does it.

I would like to see regex folded into the Python core, but so far its
author has shown no sign of wanting to do so. BTW, regex is also
harder to provoke into exponential-time bad cases.

While learning to avoid bad cases to begin with requires climbing a
learning curve, it's not all that hard. People typically get in
trouble by slamming in \s* and \s+ all over the place, then get
surprised in a non-matching-overall case when the engine needs
exponential time to work through all possible ways of segmenting
whitespace. Well, that's what you told it to do.

Friedl's book spells out systematic ways to avoid that. regex also
supports newer ideas, like "atomic groups", that can be used
explicitly to limit backtracking.

BTW, there's scant general use for a "linear time" regexp engine. If
it's "pure", it can require exponential _space_ to build a compiled
regexp, and won't support many features people have become used to.
The most gonzo modern packages are indeed "gonzo", doing a mix of
linear-time algorithms and on-the-fly switching to backtracking
algorithms akin to Python's and Perl's. Just supporting capturing
groups (which almost all people using regexps for "parsing" use)
typically requires two passes under the covers: a linear-time pass to
see whether there's a match overall, and, if that succeeds, a second
pass with possible backtracking, using crumbs dropped by the first
pass to cut off some futile alternatives.

For those with long memories, regexps in practical Comp Sci were
typically used for lexers, simple expressions to identify tokens which
were in turn consumed by actual parsers. The Unix™ lex and yacc were
very popular in their day, each using the right tool for its job.

An interesting lesson nobody wants to learn: the original major
string-processing language, SNOBOL, had powerful pattern matching but
no regexps. Griswold's more modern successor language, Icon, found no
reason to change that. Naive regexps are both clumsy and prone to bad
timing in many tasks that "should be" very easy to express. For
example, "now match up to the next occurrence of 'X'". In SNOBOL and
Icon, that's trivial. 75% of regexp users will write ".*X", with scant
understanding that it may match waaaay more than they intended.
Another 20% will write ".*?X", with scant understanding that may
extend beyond _just_ "the next" X in some cases. That leaves the happy
5% who write "[^X]*X", which finally says what they intended from the
start.

Read the Zawinski quote again ;-)

[Python-ideas] Re: Regex timeouts

Tim Peters