[Python-Dev] Should we move to replace re with regex?

Guido van Rossum guido at python.org
Sun Aug 28 05:54:13 CEST 2011


On Sat, Aug 27, 2011 at 5:48 PM, Terry Reedy <tjreedy at udel.edu> wrote:
> Many of the things regex does differently might be called either bug fixes
> or feature changes, depending on one's viewpoint. Regex should definitely
> not be 'bug-compatible'.

Well, as you said, it depends on one's viewpoint. If there's a bug in
the treatment of non-BMP character ranges, that's a bug, and fixing it
shouldn't break anybody's code (unless it was worth breaking :-). But
if there's a change that e.g. (hypothetical example) makes a different
choice about how empty matches are treated in some edge case, and the
old behavior was properly documented, that's a feature change, and I'd
rather introduce a flag to select the new behavior (or, if we have to,
a flag to preserve the old behavior, if the new behavior is really
considered much better and much more useful).

> I think regex should be unicode-standard compliant as much as possible, and
> let the chips fall where they may.

In most cases the Unicode improvements in regex are not where it is
incompatible; e.g. adding \X and named ranges are fine new additions
and IIUC the syntax was carefully designed not to introduce any
incompatibilities (within the limitations of \-escapes).

It's the many other "improvements" to the regex module that sometimes
make it incompatible.There's a comprehensive list here:
http://pypi.python.org/pypi/regex . Somebody should just go over it
and for each difference make a recommendation for whether to treat
this as a bugfix, a compatible new feature, or an incompatibility that
requires some kind of flag. (We could have a single flag for all
incompatibilities, or several flags.)

> If so, it would be like the decimal
> module, which closely tracks the IEEE decimal standard, rather than the
> binary float standard.

Well, I would hope that for each "major" Python version (i.e. 3.2,
3.3, 3.4, ...) we would pick a specific version of the Unicode
standard and declare our desire to be compliant with that Unicode
standard version, and not switch allegiances in some bugfix version
(e.g. 3.2.3, 3.3.1, ...).

> Regex is already much more compliant than re, as shown by Tom Christiansen.

Nobody disagrees with this or thinks it's a bad thing. :-)

> This is pretty obviously intentional on MB's part.

That's also clear.

> It is also probably intentional that re *not* match today's Unicode
> TR18 specifications.

That I'm not so sure of. I think it's more the case that TR18 evolved
and that the re modules didn't -- probably mostly because nobody had
the time and nobody was aware of the TR18 changes.

> These are reasons why both Ezio and I suggested on the tracker adding regex
> without deleting re. (I personally would not mind just replacing re with
> regex, but then I have no legacy re code to break. So I am not suggesting
> that out of respect for those who do.)

That option is definitely still on the table. At the very least a
thorough review of the stated differences between re and regex should
be done -- I trust that MR has been very thorough in his listing of
those differences. The issues regarding maintenance and stability of
MR's code can be solved in a number of ways -- if MR doesn't mind I
would certainly be willing to give him core committer access (though
I'd still recommend that he use his time primarily to train others in
maintaining this important code base).

-- 
--Guido van Rossum (python.org/~guido)


More information about the Python-Dev mailing list