[Python-Dev] interesting article on regex performance

Nick Coghlan ncoghlan at gmail.com
Fri Mar 12 17:12:23 CET 2010


Peter Portante wrote:
> http://code.google.com/p/re2/
> 
> On 3/11/10 8:52 PM, "Neal Becker" <ndbecker2 at gmail.com> wrote:
> 
>> http://swtch.com/~rsc/regexp/regexp1.html

Both interesting links. I'll add another one to the list:
http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html

To bring this on-topic for python-dev by considering how it could apply
to Python's default re engine, I think the key issue is that any updates
to the default engine would need to remain backwards compatible with all
of the tricks that re2 doesn't support.

There are major practical problems associated with making such a leap
directly (Google's re2 engine is in C++ rather than C and we'd have to
keep the existing implementation around regardless to handle the
features that re2 doesn't support).

I would say it is better to let re2 bake for a while and see if anyone
is motivated to come up with Python bindings for it and release them on
PyPI. Once that happens (and assuming the bindings earn a good
reputation), the first step towards integration would be to include a
"See Also" in the re module documentation to point people towards the
more limited (but more robust) regex engine implementation.

The next step would probably be a hybrid third party library that
exploits the NFA approach when it can, but resorts to backtracking when
it has to in order to handle full regex functionality. (Although
developers would need to be able to disable the backtracking support in
order to restore re2's guarantees of linear time execution)

Only once such a hybrid library had been proven in practice could the
default re module be considered for possible update.

Basically, I see some interesting ideas there, but I don't see anything
that will impact the Python core or standard library in the short term.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
---------------------------------------------------------------


More information about the Python-Dev mailing list