# The Regex Story

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Apr 9 04:59:43 EDT 2010

```On Fri, 09 Apr 2010 14:48:22 +1000, Lie Ryan wrote:

> On 04/09/10 12:32, Dotan Cohen wrote:
>>> Regexes do have their uses. It's a case of knowing when they are the
>>> best approach and when they aren't.
>>
>> Agreed. The problems begin when the "when they aren't" is not
>> recognised.
>
> But problems also arises when people are suggesting overly complex
> series of built-in functions for what is better handled by regex.

What defines "overly complex"?

For some reason, people seem to have the idea that pattern matching of
strings must be a single expression, no matter how complicated the
pattern they're trying to match. If we have a complicated task to do in
almost any other field, we don't hesitate to write a function to do it,
or even multiple functions: we break our code up into small,
understandable, testable pieces. We recognise that a five-line function
may very well be less complex than a one-line expression that does the
same thing. But if it's a string pattern matching task, we somehow become
resistant to the idea of writing a function and treat one-line
expressions as "simpler", no matter how convoluted they become.

It's as if we decided that every maths problem had to be solved by a
single expression, no matter how complex, and invented a painfully terse
language unrelated to normal maths syntax for doing so:

# Calculate the roots of sin**2(3*x-y):
result = me.compile("{^g.?+*y:h}|\Y^r&(?P:2+)|\w+(x&y)|[?#\s]").solve()

That's not to say that regexes aren't useful, or that they don't have
advantages. They are well-studied from a theoretical basis. You don't
have to re-invent the wheel: the re module provides useful pattern
matching functionality with quite good performance.

One disadvantage is that you have to learn an entire new language, a
language which is painfully terse and obfuscated, with virtually no
support for debugging. Larry Wall has criticised the Perl regex syntax on
a number of grounds:

* things which look similar often are very different;
* things which are commonly needed are long and verbose, while things
which are rarely needed are short;
* too much reliance on too few metacharacters;
* the default is to treat whitespace around tokens as significant,
instead of defaulting to verbose-mode for readability;
* overuse of parentheses;
* difficulty working with non-ASCII data;
* insufficient abstraction;
* even though regexes are source code in a regular expression language,
they're treated as mere strings, even in Perl;

and many others.

http://dev.perl.org/perl6/doc/design/apo/A05.html

As programming languages go, regular expressions -- even Perl's regular
expressions on steroids -- are particularly low-level. It's the assembly
language of pattern matching, compared to languages like Prolog, SNOBOL
and Icon. These languages use patterns equivalent in power to Backus-Naur
Form grammars, or context-free grammars, much more powerful and readable
than regular expressions.

But in any case, not all text processing problems are pattern-matching
problems, and even those that are don't necessarily require the 30lb
sledgehammer of regular expressions.

I find it interesting to note that there is such a thing as "regex
culture", as Larry Wall describes it. There seems to be a sort of
programmers' machismo about solving problems via regexes, even when
they're not the right tool for the job, and in the fewest number of
characters possible. I think regexes have a bad reputation because of
regex culture, and not just within Python circles either:

http://echochamber.me/viewtopic.php?f=11&t=57405

For the record, I'm not talking about "Because It's There" regexes like
this this 6343-character monster:

or these:

http://mail.pm.org/pipermail/athens-pm/2003-January/000033.html
http://blog.sigfpe.com/2007/02/modular-arithmetic-with-regular.html

The fact that these exist at all is amazing and wonderful. And yes, I
admire the Obfuscated C and Underhanded C contests too :)

--
Steven

```