
""" Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. - Jamie Zawinski """ Even more true of regexps than of floating point, and even of daemon threads ;-) regex is a terrific module, incorporating many features that newer regexp implementations support. But there's a downside: if anyone was inclined to work on Python's regexp engine, the existence of that more refined alternative is a huge disincentive. Why bother? regex already does it. I would like to see regex folded into the Python core, but so far its author has shown no sign of wanting to do so. BTW, regex is also harder to provoke into exponential-time bad cases. While learning to avoid bad cases to begin with requires climbing a learning curve, it's not all that hard. People typically get in trouble by slamming in \s* and \s+ all over the place, then get surprised in a non-matching-overall case when the engine needs exponential time to work through all possible ways of segmenting whitespace. Well, that's what you told it to do. Friedl's book spells out systematic ways to avoid that. regex also supports newer ideas, like "atomic groups", that can be used explicitly to limit backtracking. BTW, there's scant general use for a "linear time" regexp engine. If it's "pure", it can require exponential _space_ to build a compiled regexp, and won't support many features people have become used to. The most gonzo modern packages are indeed "gonzo", doing a mix of linear-time algorithms and on-the-fly switching to backtracking algorithms akin to Python's and Perl's. Just supporting capturing groups (which almost all people using regexps for "parsing" use) typically requires two passes under the covers: a linear-time pass to see whether there's a match overall, and, if that succeeds, a second pass with possible backtracking, using crumbs dropped by the first pass to cut off some futile alternatives. For those with long memories, regexps in practical Comp Sci were typically used for lexers, simple expressions to identify tokens which were in turn consumed by actual parsers. The Unix™ lex and yacc were very popular in their day, each using the right tool for its job. An interesting lesson nobody wants to learn: the original major string-processing language, SNOBOL, had powerful pattern matching but no regexps. Griswold's more modern successor language, Icon, found no reason to change that. Naive regexps are both clumsy and prone to bad timing in many tasks that "should be" very easy to express. For example, "now match up to the next occurrence of 'X'". In SNOBOL and Icon, that's trivial. 75% of regexp users will write ".*X", with scant understanding that it may match waaaay more than they intended. Another 20% will write ".*?X", with scant understanding that may extend beyond _just_ "the next" X in some cases. That leaves the happy 5% who write "[^X]*X", which finally says what they intended from the start. Read the Zawinski quote again ;-)