[Python-Dev] what is happening with the regex module going into Python 3.3?

Nick Coghlan ncoghlan at gmail.com
Sat Jun 2 07:14:02 CEST 2012


On Sat, Jun 2, 2012 at 10:37 AM, Mark Lawrence <breamoreboy at yahoo.co.uk> wrote:
> On 01/06/2012 18:27, Brett Cannon wrote:
>>
>> About the only thing I can think of from the language summit that we
>> discussed doing for Python 3.3 that has not come about is accepting the
>> regex module and getting it into the stdlib. Is this still being worked
>> towards?
>>
>
> Umpteen versions of regex have been available on pypi for years. Umpteen
> bugs against the original re module have been fixed.  If regex can't now go
> into the standard library, what on earth can?

That's why it's approved *in principle* already. However, it's not a
simple matter of dropping something into the standard library and
calling it done, especially an extension module as complex as regex.
Even integrating a simple pure Python module like ipaddr took
substantial effort:

1. The API had to be reviewed to see if it was suitable for someone
that was *not* familiar with the problem domain, but was instead
learning about it from the standard library documentation. This isn't
a big concern for regex, since it is replacing the existing re module,
but this is the main reason ipaddr became ipaddress before PEP 3144
was approved (The ipaddr API plays fast and loose with network
terminology in a way that someone that already *knows* that
terminology can easily grasp, but would have been incredibly confusing
to someone that is discovering those terms for the first time).

2. The code had to actually be added to the standard library (not a
big effort for PEP 3144 - saving ipaddress.py into Lib/ and
test_ipaddress.py into Lib/test/ pretty much covered it)

3. Redundant 2.x cruft needed to be weeded out (ongoing)

4. The howto guide needed to be incorporated into the documentation
(and rewritten to be more suitable for genuine beginners)

5. An API module reference still needs to be incorporated into the
standard library reference

The effort to integrate regex is going to be substantially higher,
since it's a significantly more complicated module:

1. A new, non-trivial C extension needs to be incorporated into both
the autotools and Windows build processes
2. Due to PEP 393, there's a major change to the string implementation
in 3.3. Does regex still build against that? Even if it builds, it
should probably be ported to the new API for performance reasons.
3. Does regex build cleanly on all platforms supported by CPython? If
not, do we need to keep the existing re module around as a fallback
mechanism?
4. How do we merge the test suites? Do we keep the existing test
suite, add the regex test suite, then filter for duplication
afterwards?
5. What, precisely, *are* the backwards incompatibilities between
regex and re? Does the standard library trigger any of them? Does the
test suite?
6. How will the PyPI backport be maintained in the future? The amount
of backwards compatibility cruft in standard library code should be
minimised, but that potentially makes backports more difficult.

ipaddress is in the 3.3 standard library because Peter Moody cared
enough about the concept to initially submit it for inclusion, and
because I volunteered to drive the review and integration process
forward and to be the final arbiter of what counted as "good enough"
for inclusion. That hasn't happened yet for regex - either nobody has
cared enough to write a PEP for it, or the bystander effect has kicked
in and everyone that cares is assuming *someone else* will take up the
burden of being the PEP champion.

So that's the first step: someone needs to take
http://bugs.python.org/issue2636 and turn it into a PEP (searching the
python-dev and python-ideas archives for references to previous
discussions of the topic would also be good, along with summarising
the open Unicode related re bugs reported by Tom Christensen where the
answer is currently "use regex from PyPI instead of the standard
library's re module" [1]).

[1] http://bugs.python.org/issue?%40search_text=&ignore=file%3Acontent&title=&%40columns=title&id=&%40columns=id&stage=&creation=&creator=tchrist&activity=&%40columns=activity&%40sort=activity&actor=&nosy=&type=&components=&versions=&dependencies=&assignee=&keywords=&priority=&%40group=priority&status=1&%40columns=status&resolution=&nosy_count=&message_count=&%40pagesize=50&%40startwith=0&%40queryname=&%40old-queryname=&%40action=search

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list