Regex driving me crazy...

Steven D'Aprano steven at REMOVE.THIS.cybersource.com.au
Wed Apr 7 22:51:53 EDT 2010


On Wed, 07 Apr 2010 18:03:47 -0700, Patrick Maupin wrote:

> BTW, although I find it annoying when people say "don't do that" when
> "that" is a perfectly good thing to do, and although I also find it
> annoying when people tell you what not to do without telling you what
> *to* do, 

Grant did give a perfectly good solution.


> and although I find the regex solution to this problem to be
> quite clean, the equivalent non-regex solution is not terrible, so I
> will present it as well, for your viewing pleasure:
> 
> >>> [x for x in '# 1  Short offline       Completed without error
>       00%'.split('  ') if x.strip()]
> ['# 1', 'Short offline', ' Completed without error', ' 00%']


This is one of the reasons we're so often suspicious of re solutions:


>>> s = '# 1  Short offline       Completed without error       00%'
>>> tre = Timer("re.split(' {2,}', s)", 
... "import re; from __main__ import s")
>>> tsplit = Timer("[x for x in s.split('  ') if x.strip()]", 
... "from __main__ import s")
>>>
>>> re.split(' {2,}', s) == [x for x in s.split('  ') if x.strip()]
True
>>> 
>>> 
>>> min(tre.repeat(repeat=5))
6.1224789619445801
>>> min(tsplit.repeat(repeat=5))
1.8338048458099365


Even when they are correct and not unreadable line-noise, regexes tend to 
be slow. And they get worse as the size of the input increases:

>>> s *= 1000
>>> min(tre.repeat(repeat=5, number=1000))
2.3496899604797363
>>> min(tsplit.repeat(repeat=5, number=1000))
0.41538596153259277
>>>
>>> s *= 10
>>> min(tre.repeat(repeat=5, number=1000))
23.739185094833374
>>> min(tsplit.repeat(repeat=5, number=1000))
4.6444299221038818


And this isn't even one of the pathological O(N**2) or O(2**N) regexes.

Don't get me wrong -- regexes are a useful tool. But if your first 
instinct is to write a regex, you're doing it wrong.


    [quote]
    A related problem is Perl's over-reliance on regular expressions 
    that is exaggerated by advocating regex-based solution in almost 
    all O'Reilly books. The latter until recently were the most
    authoritative source of published information about Perl. 

    While simple regular expression is a beautiful thing and can 
    simplify operations with string considerably, overcomplexity in
    regular expressions is extremly dangerous: it cannot serve a basis
    for serious, professional programming, it is fraught with pitfalls,
    a big semantic mess as a result of outgrowing its primary purpose. 
    Diagnostic for errors in regular expressions is even weaker then 
    for the language itself and here many things are just go unnoticed.
    [end quote]

http://www.softpanorama.org/Scripting/Perlbook/Ch01/
place_of_perl_among_other_lang.shtml



Even Larry Wall has criticised Perl's regex culture:

http://dev.perl.org/perl6/doc/design/apo/A05.html




-- 
Steven



More information about the Python-list mailing list