The Regex Story

Steven D'Aprano steve at
Fri Apr 9 10:59:43 CEST 2010

On Fri, 09 Apr 2010 14:48:22 +1000, Lie Ryan wrote:

> On 04/09/10 12:32, Dotan Cohen wrote:
>>> Regexes do have their uses. It's a case of knowing when they are the
>>> best approach and when they aren't.
>> Agreed. The problems begin when the "when they aren't" is not
>> recognised.
> But problems also arises when people are suggesting overly complex
> series of built-in functions for what is better handled by regex.

What defines "overly complex"?

For some reason, people seem to have the idea that pattern matching of 
strings must be a single expression, no matter how complicated the 
pattern they're trying to match. If we have a complicated task to do in 
almost any other field, we don't hesitate to write a function to do it, 
or even multiple functions: we break our code up into small, 
understandable, testable pieces. We recognise that a five-line function 
may very well be less complex than a one-line expression that does the 
same thing. But if it's a string pattern matching task, we somehow become 
resistant to the idea of writing a function and treat one-line 
expressions as "simpler", no matter how convoluted they become.

It's as if we decided that every maths problem had to be solved by a 
single expression, no matter how complex, and invented a painfully terse 
language unrelated to normal maths syntax for doing so:

# Calculate the roots of sin**2(3*x-y):
result = me.compile("{^g.?+*y:h}|\Y^r&(?P:2+)|\w+(x&y)|[?#\s]").solve()

That's not to say that regexes aren't useful, or that they don't have 
advantages. They are well-studied from a theoretical basis. You don't 
have to re-invent the wheel: the re module provides useful pattern 
matching functionality with quite good performance.

One disadvantage is that you have to learn an entire new language, a 
language which is painfully terse and obfuscated, with virtually no 
support for debugging. Larry Wall has criticised the Perl regex syntax on 
a number of grounds:

* things which look similar often are very different;
* things which are commonly needed are long and verbose, while things 
which are rarely needed are short;
* too much reliance on too few metacharacters;
* the default is to treat whitespace around tokens as significant, 
instead of defaulting to verbose-mode for readability;
* overuse of parentheses;
* difficulty working with non-ASCII data;
* insufficient abstraction;
* even though regexes are source code in a regular expression language, 
they're treated as mere strings, even in Perl;

and many others.

As programming languages go, regular expressions -- even Perl's regular 
expressions on steroids -- are particularly low-level. It's the assembly 
language of pattern matching, compared to languages like Prolog, SNOBOL 
and Icon. These languages use patterns equivalent in power to Backus-Naur 
Form grammars, or context-free grammars, much more powerful and readable 
than regular expressions.

But in any case, not all text processing problems are pattern-matching 
problems, and even those that are don't necessarily require the 30lb 
sledgehammer of regular expressions.

I find it interesting to note that there is such a thing as "regex 
culture", as Larry Wall describes it. There seems to be a sort of 
programmers' machismo about solving problems via regexes, even when 
they're not the right tool for the job, and in the fewest number of 
characters possible. I think regexes have a bad reputation because of 
regex culture, and not just within Python circles either:

For the record, I'm not talking about "Because It's There" regexes like 
this this 6343-character monster:

or these:

The fact that these exist at all is amazing and wonderful. And yes, I 
admire the Obfuscated C and Underhanded C contests too :)


More information about the Python-list mailing list