[Python-Dev] re performance

Vlastimil Brom vlastimil.brom at gmail.com
Thu Jan 26 16:33:57 EST 2017


2017-01-26 22:13 GMT+01:00 Sven R. Kunze <srkunze at mail.de>:
> Hi folks,
>
> I recently refreshed regular expressions theoretical basics *indulging in
> reminiscences* So, I read https://swtch.com/~rsc/regexp/regexp1.html
>
> However, reaching the chart in the lower third of the article, I saw Python
> 2.4 measured against a naive Thompson matching implementation. And I was
> surprised about how bad it performed compared to an unoptimized version of
> an older than dirt algorithm.
>
> So, I decided to give it a try with Python2.7 and Python3.5. Eh, voilà, 100%
> cpu and no results so far. Nothing has changed at all since 2007.
>
>>>> import re
>>>> re.match(r'a?'*30 + r'a'*30, 'a'*30)
> .... (still waiting)
>
> Quoting from the article: " Some might argue that this test is unfair to the
> backtracking implementations, since it focuses on an uncommon corner case.
> This argument misses the point: given a choice between an implementation
> with a predictable, consistent, fast running time on all inputs or one that
> usually runs quickly but can take years of CPU time (or more) on some
> inputs, the decision should be easy."
>
> Victor, as the head of Python performance department, and Matthew, as the
> maintainer of the new regex module, what is your stance on this particular
> issue?
>
> From my perspective, I can say, that regular expressions might worth
> optimizing especially for web applications (url matching usually uses
> regexes) but also for other applications where I've seen many tight loops
> using regexes as well. So, I am probing interest on this topic here.
>
> Best,
> Sven
>

Hi,
I can't speak about the details of mrab's implementation, but using
regex, I get the resulting match instantly:

Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 07:18:10) [MSC v.1900
32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import regex
>>> regex.match(r'a?'*30 + r'a'*30, 'a'*30)
<regex.Match object; span=(0, 30), match='aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'>
>>>

(I personally prefer to use regex for other advantages, than this
artificial case, but it certainly doesn't hurt to have better
performance here too:)

regards,
    vbr


More information about the Python-Dev mailing list