[Python-Dev] interesting article on regex performance

Fri Mar 12 20:08:44 CET 2010

On Fri, Mar 12, 2010 at 8:12 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
[snip]
> To bring this on-topic for python-dev by considering how it could apply
> to Python's default re engine, I think the key issue is that any updates
> to the default engine would need to remain backwards compatible with all
> of the tricks that re2 doesn't support.
>
> There are major practical problems associated with making such a leap
> directly (Google's re2 engine is in C++ rather than C and we'd have to
> keep the existing implementation around regardless to handle the
> features that re2 doesn't support).

I don't see why C++ would be a deal-breaker in this case, since it
would be restricted to an extension module.

> I would say it is better to let re2 bake for a while and see if anyone
> is motivated to come up with Python bindings for it and release them on
> PyPI.

FWIW, re2 is heavily, heavily used in production at Google.
Stabilizing any proposed Python bindings would be a good idea, but I'm
not sure how much more "baking" re2's core functionality needs.

> Once that happens (and assuming the bindings earn a good
> reputation), the first step towards integration would be to include a
> "See Also" in the re module documentation to point people towards the
> more limited (but more robust) regex engine implementation.
>
> The next step would probably be a hybrid third party library that
> exploits the NFA approach when it can, but resorts to backtracking when
> it has to in order to handle full regex functionality. (Although
> developers would need to be able to disable the backtracking support in
> order to restore re2's guarantees of linear time execution)

We considered such a hybrid approach for Unladen Swallow, but rejected
it on the advice of the V8 team [1]: you end up maintaining two
totally separate, incompatible regex engines; the hybrid system comes
with interesting, possibly unintuitive performance/correctness issues
when bailing from one implementation to another; performance is
unstable as small changes are made to the regexes; and it's not
obvious that enough Python regexes would benefit from re2's
performance/restrictions tradeoff to make such a hybrid system
worthwhile in the long term. (There is no representative corpus of
real-world Python regexes weighted for dynamic execution frequency to
use in assessing such tradeoffs empirically like there is for
JavaScript.)

re2 is very useful when you want to run user-provided regexes and want
to protect your backends against pathological/malicious regex input,
but I'm not sure how applicable it is to Python. I think there are
more promising strategies to improve regex performance, such as
reusing the new JIT infrastructure to JIT-compile regular expressions
to machine code (along the lines of V8's irregexp). Some work has
already been done in this direction, and I'd be thrilled to mentor any
GSoC students interested in working on such a project this summer.

Lastly, anyone interested in working on Python regex performance
should take a look at the regex benchmarks in the standard benchmark
suite [2].

Thanks,
Collin Winter

[1] - http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html#c4843826268005492354
[2] - http://hg.python.org/benchmarks/file/5b8fe389710b/performance