[Python-Dev] interesting article on regex performance

Tres Seaver tseaver at palladion.com
Fri Mar 12 18:02:46 CET 2010


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Nick Coghlan wrote:
> Peter Portante wrote:
>> http://code.google.com/p/re2/
>>
>> On 3/11/10 8:52 PM, "Neal Becker" <ndbecker2 at gmail.com> wrote:
>>
>>> http://swtch.com/~rsc/regexp/regexp1.html
> 
> Both interesting links. I'll add another one to the list:
> http://google-opensource.blogspot.com/2010/03/re2-principled-approach-to-regular.html
> 
> To bring this on-topic for python-dev by considering how it could apply
> to Python's default re engine, I think the key issue is that any updates
> to the default engine would need to remain backwards compatible with all
> of the tricks that re2 doesn't support.
> 
> There are major practical problems associated with making such a leap
> directly (Google's re2 engine is in C++ rather than C and we'd have to
> keep the existing implementation around regardless to handle the
> features that re2 doesn't support).
> 
> I would say it is better to let re2 bake for a while and see if anyone
> is motivated to come up with Python bindings for it and release them on
> PyPI. Once that happens (and assuming the bindings earn a good
> reputation), the first step towards integration would be to include a
> "See Also" in the re module documentation to point people towards the
> more limited (but more robust) regex engine implementation.
> 
> The next step would probably be a hybrid third party library that
> exploits the NFA approach when it can, but resorts to backtracking when
> it has to in order to handle full regex functionality. (Although
> developers would need to be able to disable the backtracking support in
> order to restore re2's guarantees of linear time execution)
> 
> Only once such a hybrid library had been proven in practice could the
> default re module be considered for possible update.
> 
> Basically, I see some interesting ideas there, but I don't see anything
> that will impact the Python core or standard library in the short term.

Agreed on the whole.  However, it might be interesting to explore the
"hybrid mode" using some of the older, C-based libraries referenced by
the original (Russ Cox) article, e.g. Rob Pike's Unicode-aware 1992
implmentation, as ported for Unix:

  http://swtch.com/plan9port/unix/


Tres.
- --
===================================================================
Tres Seaver          +1 540-429-0999          tseaver at palladion.com
Palladion Software   "Excellence by Design"    http://palladion.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkuac7AACgkQ+gerLs4ltQ6t3wCgxQPsnpbyE1N9BEGg7nHqcBVP
q04AnA6RMJw83+uc5u6Oyud89SXEoP5C
=+0aL
-----END PGP SIGNATURE-----



More information about the Python-Dev mailing list