[Python-Dev] Should we move to replace re with regex?

M.-A. Lemburg mal at egenix.com
Sat Aug 27 01:00:31 CEST 2011


Guido van Rossum wrote:
> On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg <mal at egenix.com> wrote:
>> Guido van Rossum wrote:
>>> I just made a pass of all the Unicode-related bugs filed by Tom
>>> Christiansen, and found that in several, the response was "this is
>>> fixed in the regex module [by Matthew Barnett]". I started replying
>>> that I thought that we should fix the bugs in the re module (i.e.,
>>> really in _sre.c) but on second thought I wonder if maybe regex is
>>> mature enough to replace re in Python 3.3. It would mean that we won't
>>> fix any of these bugs in earlier Python versions, but I could live
>>> with that.
>>>
>>> However, I don't know much about regex -- how compatible is it, how
>>> fast is it (including extreme cases where the backtracking goes
>>> crazy), how bug-free is it, and so on. Plus, how much work would it be
>>> to actually incorporate it into CPython as a complete drop-in
>>> replacement of the re package (such that nobody needs to change their
>>> imports or the flags they pass to the re module).
>>>
>>> We'd also probably have to train some core developers to be familiar
>>> enough with the code to maintain and evolve it -- I assume we can't
>>> just volunteer Matthew to do so forever... :-)
>>>
>>> What's the alternative? Is adding the requested bug fixes and new
>>> features to _sre.c really that hard?
>>
>> Why not simply add the new lib, see whether it works out and
>> then decide which path to follow.
>>
>> We've done that with the old regex lib. It took a few years
>> and releases to have people port their applications to the
>> then new re module and syntax, but in the end it worked.
>>
>> With a new regex library there are likely going to be quite
>> a few subtle differences between re and regex - even if it's
>> just doing things in a more Unicode compatible way.
>>
>> I don't think anyone can actually list all the differences given
>> the complex nature of regular expressions, so people will
>> likely need a few years and releases to get used it before
>> a switch can be made.
> 
> I can't say I liked how that transition was handled last time around.
> I really don't want to have to tell people "Oh, that bug is fixed but
> you have to use regex instead of re" and then a few years later have
> to tell them "Oh, we're deprecating regex, you should just use re".

No, you tell them: "If you want Unicode 6 semantics, use regex,
if you're fine with Unicode 2.0/3.0 semantics, use re". After all,
it's not like re suddenly stopped working :-)

> I'm really hoping someone has more actual technical understanding of
> re vs. regex and can give us some facts about the differences, rather
> than, frankly, FUD.

The good part is that it's based on the re code, the FUD comes
from the fact that the new lib is 380kB larger than the old one
and that's not even counting the generated 500kB of lookup
tables.

If no one steps up to do a review or analysis, I think the
only practical way to test the lib is to give it a prominent
chance to prove itself.

The other aspect is maintenance.

Perhaps we could have a summer of code student do a review and
analysis to get familiar with the code and then have at least
two developers know the code well enough to support it for
a while.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Aug 27 2011)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                38 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-Dev mailing list