[Python-Dev] interesting article on regex performance

Sat Mar 13 14:59:57 CET 2010

Collin Winter wrote:
> On Fri, Mar 12, 2010 at 8:12 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
>> There are major practical problems associated with making such a leap
>> directly (Google's re2 engine is in C++ rather than C and we'd have to
>> keep the existing implementation around regardless to handle the
>> features that re2 doesn't support).
> 
> I don't see why C++ would be a deal-breaker in this case, since it
> would be restricted to an extension module.

It wouldn't be a dealbreaker for an optional regex acceleration module,
I agree. It's still a practical problem in the sense that C++ code is
very much in the minority in the CPython code base.

>> I would say it is better to let re2 bake for a while and see if anyone
>> is motivated to come up with Python bindings for it and release them on
>> PyPI.
> 
> FWIW, re2 is heavily, heavily used in production at Google.
> Stabilizing any proposed Python bindings would be a good idea, but I'm
> not sure how much more "baking" re2's core functionality needs.

It was more the bindings side of things I was referring to - the
algorithm's performance is no doubt well established, but how easy it is
for non-Google users to pick up and integrate with other software is
still to be determined.

> We considered such a hybrid approach for Unladen Swallow, but rejected
> it on the advice of the V8 team [1]: you end up maintaining two
> totally separate, incompatible regex engines; the hybrid system comes
> with interesting, possibly unintuitive performance/correctness issues
> when bailing from one implementation to another; performance is
> unstable as small changes are made to the regexes; and it's not
> obvious that enough Python regexes would benefit from re2's
> performance/restrictions tradeoff to make such a hybrid system
> worthwhile in the long term. (There is no representative corpus of
> real-world Python regexes weighted for dynamic execution frequency to
> use in assessing such tradeoffs empirically like there is for
> JavaScript.)
> 
> re2 is very useful when you want to run user-provided regexes and want
> to protect your backends against pathological/malicious regex input,
> but I'm not sure how applicable it is to Python. I think there are
> more promising strategies to improve regex performance, such as
> reusing the new JIT infrastructure to JIT-compile regular expressions
> to machine code (along the lines of V8's irregexp). Some work has
> already been done in this direction, and I'd be thrilled to mentor any
> GSoC students interested in working on such a project this summer.
> 
> Lastly, anyone interested in working on Python regex performance
> should take a look at the regex benchmarks in the standard benchmark
> suite [2].

This is the kind of baking that I was really talking about - there's an
awful lot of investigation to be done before moving Python in the
direction of an NFA based regex engine could be seriously considered.

Good to hear the perspective of someone that has already looked into
this in some detail though.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------