Should we move to replace re with regex?

I just made a pass of all the Unicode-related bugs filed by Tom Christiansen, and found that in several, the response was "this is fixed in the regex module [by Matthew Barnett]". I started replying that I thought that we should fix the bugs in the re module (i.e., really in _sre.c) but on second thought I wonder if maybe regex is mature enough to replace re in Python 3.3. It would mean that we won't fix any of these bugs in earlier Python versions, but I could live with that. However, I don't know much about regex -- how compatible is it, how fast is it (including extreme cases where the backtracking goes crazy), how bug-free is it, and so on. Plus, how much work would it be to actually incorporate it into CPython as a complete drop-in replacement of the re package (such that nobody needs to change their imports or the flags they pass to the re module). We'd also probably have to train some core developers to be familiar enough with the code to maintain and evolve it -- I assume we can't just volunteer Matthew to do so forever... :-) What's the alternative? Is adding the requested bug fixes and new features to _sre.c really that hard? -- --Guido van Rossum (python.org/~guido)

Guido van Rossum wrote:
Why not simply add the new lib, see whether it works out and then decide which path to follow. We've done that with the old regex lib. It took a few years and releases to have people port their applications to the then new re module and syntax, but in the end it worked. With a new regex library there are likely going to be quite a few subtle differences between re and regex - even if it's just doing things in a more Unicode compatible way. I don't think anyone can actually list all the differences given the complex nature of regular expressions, so people will likely need a few years and releases to get used it before a switch can be made. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2011)
2011-10-04: PyCon DE 2011, Leipzig, Germany 38 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg <mal@egenix.com> wrote:
I can't say I liked how that transition was handled last time around. I really don't want to have to tell people "Oh, that bug is fixed but you have to use regex instead of re" and then a few years later have to tell them "Oh, we're deprecating regex, you should just use re". I'm really hoping someone has more actual technical understanding of re vs. regex and can give us some facts about the differences, rather than, frankly, FUD. -- --Guido van Rossum (python.org/~guido)

On Fri, 26 Aug 2011 15:18:35 -0700 Guido van Rossum <guido@python.org> wrote:
The best way would be to contact the author, Matthew Barnett, or to ask on the tracker on http://bugs.python.org/issue2636. He has been quite willing to answer such questions in the past, AFAIR. Regards Antoine.

On Fri, Aug 26, 2011 at 3:33 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I had added him to the beginning of this thread but someone took him off.
So, that issue is about something called "regexp". AFAIK Matthew (MRAB) wrote something called "regex" (http://pypi.python.org/pypi/regex). Are they two different things??? -- --Guido van Rossum (python.org/~guido)

On Fri, 26 Aug 2011 15:47:21 -0700 Guido van Rossum <guido@python.org> wrote:
No, it's the same. The source is at https://code.google.com/p/mrab-regex-hg/, btw. Regards Antoine.

Guido van Rossum wrote:
No, you tell them: "If you want Unicode 6 semantics, use regex, if you're fine with Unicode 2.0/3.0 semantics, use re". After all, it's not like re suddenly stopped working :-)
The good part is that it's based on the re code, the FUD comes from the fact that the new lib is 380kB larger than the old one and that's not even counting the generated 500kB of lookup tables. If no one steps up to do a review or analysis, I think the only practical way to test the lib is to give it a prominent chance to prove itself. The other aspect is maintenance. Perhaps we could have a summer of code student do a review and analysis to get familiar with the code and then have at least two developers know the code well enough to support it for a while. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2011)
2011-10-04: PyCon DE 2011, Leipzig, Germany 38 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, 27 Aug 2011 01:00:31 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:
It has a whole lot of new features in addition to better unicode support. See for yourself: https://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails
I'm not sure a GSoC student would be the best candidate to do a review matching our expectations. Regards Antoine.

"M.-A. Lemburg" <mal@egenix.com> writes:
Guido van Rossum wrote:
What do we say, then, to those who are unaware of the different semantics between those versions of Unicode, and want regular expression to “just work” in Python? To which document can we direct them to understand what semantics they want?
After all, it's not like re suddenly stopped working :-)
For some value of “working”, that is. The trick is to know whether that value is what one wants. -- \ “The fact of your own existence is the most astonishing fact | `\ you'll ever have to confront. Don't dare ever see your life as | _o__) boring, monotonous, or joyless.” —Richard Dawkins, 2010-03-10 | Ben Finney

Ben Finney wrote:
"M.-A. Lemburg" <mal@egenix.com> writes:
Presumably, like all modules, both the re and the regex module will have their own individual pages in the library reference. As the newcomer, regex should include a discussion of differences between the two. This can then be quietly dropped once re becomes formally deprecated. (Assuming that the std lib keeps re and regex in parallel for a few releases, which is not a given.) However, I note that last time, the old regex module was just documented as obsolete with little detailed discussion of the differences: http://docs.python.org/release/1.5/lib/node69.html#SECTION005300000000000000... -- Steven

Steven D'Aprano <steve@pearwood.info> writes:
My question is directed more to M-A Lemburg's passage above, and its implicit assumption that the user understand the changes between “Unicode 2.0/3.0 semantics” and “Unicode 6 semantics”, and how their own needs relate to those semantics. For programmers who know they want to follow Unicode conventions in Python, but don't know the distinction M-A Lemburg is drawing, to which document does he recommend we direct them? “The Unicode specification document in its various versions” isn't a feasible answer. -- \ “Computers are useless. They can only give you answers.” —Pablo | `\ Picasso | _o__) | Ben Finney

Ben Finney wrote:
I can only repeat my answer: the docs for the new regex module should include a discussion of the differences. If that requires summarising the differences that M-A Lemburg refers to, then so be it.
“The Unicode specification document in its various versions” isn't a feasible answer.
Presumably the Unicode spec will be the canonical source, but I agree that we should not expect people to read that in order to make a decision between re and regex. -- Steven

On Fri, Aug 26, 2011 at 2:45 PM, Guido van Rossum <guido@python.org> wrote:
...but on second thought I wonder if maybe regex is mature enough to replace re in Python 3.3.
I agree that the move from regex to re was kind of painful. It seems someone should merge the unit tests for re and regex, and apply the merged result to each for the sake of comparison. There might also be a need to expand the merged result to include new things. Then there probably should be a from __future__ import for a while.

On Fri, 26 Aug 2011 15:48:42 -0700 Dan Stromberg <drsalists@gmail.com> wrote:
Then there probably should be a from __future__ import for a while.
If you are willing to use a "from __future__ import", why not simply import regex as re ? We're not Perl, we don't have built-in syntactic support for regular expressions. Regards Antoine.

On Fri, Aug 26, 2011 at 5:08 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
If you add regex as "import regex", and the new regex module doesn't work out, regex might be harder to get rid of. from __future__ import is an established way of trying something for a while to see if it's going to work. EG: "from __future__ import re", where re is really the new module. But whatever.

On Fri, 26 Aug 2011 17:25:56 -0700 Dan Stromberg <drsalists@gmail.com> wrote:
That's an interesting idea. This way, integrating the new module would be a less risky move, since if it gives us too many problems, we could back out our decision in the next feature release. Regards Antoine.

Antoine Pitrou wrote:
I'm not sure that's correct. If there are differences in either the interface or the behaviour between the new regex and re, then reverting will be a pain regardless of whether you have: from __future__ import re re.compile(...) or import regex regex.compile(...) Either way, if the new regex library goes away, code will break, and fixing it may not be easy. It's not likely to be so easy that merely deleting the "from __future__ ..." line will do it, but if it is that easy, then using "import re as regex" will be just as easy. Have then been any __future__ features that were added provisionally? I can't think of any. That's not what __future__ is for, at least according to PEP 236. http://www.python.org/dev/peps/pep-0236/ I can't think of any __future__ feature that could be easily reverted once people start relying on it. Either syntax would break, or behaviour would change. The PEP even explicitly states that __future__ should not be used for changes which are backward compatible: Note that there is no need to involve the future_statement machinery in new features unless they can break existing code; fully backward- compatible additions can-- and should --be introduced without a corresponding future_statement. I wasn't around for the move from 1.4 regex to 1.5 re, so I don't know what was done poorly last time. But I can't see why we should treat regular expressions so differently from (say) argparse and optparse. from __future__ import optparse No. Just... no. -- Steven

On Fri, Aug 26, 2011 at 8:47 PM, Steven D'Aprano <steve@pearwood.info>wrote:
You're talking technically, which is important, but wasn't what I was suggesting would be helped. Politically, and from a marketing standpoint, it's easier to withdraw a feature you've given with a "Play with this, see if it works for you" warning. Have then been any __future__ features that were added provisionally?
I can't either, but ISTR hearing that from __future__ import was started with such an intent. Irrespective, it's hard to import something from "future" without at least suspecting that you're on the bleeding edge.

I can't either, but ISTR hearing that from __future__ import was started with such an intent.
No, not at all. The original intention was to enable features that would definitely would be added, not just right now. Tim Peters always objected to claims that future imports were talking about provisional features.
We don't want to add features to Python that we may have to withdraw. If there is doubt whether they should be added, they shouldn't be added. If they do get added, we have to live with it (until, say, Python 4, where bad features can be removed again). Regards, Martin

On Sat, Aug 27, 2011 at 4:01 PM, Dan Stromberg <drsalists@gmail.com> wrote:
The standard library isn't for playing. "pip install regex" is for playing. If we aren't sure we want to make the transition, then it doesn't go in. However, to my mind, reviewing and incorporating regex is a far more feasible model than trying to enhance the existing re module with a comparable feature set. At the moment, there's already an obvious way to get enhanced regex support in Python: install regex and use it instead of the standard library's re module. That's enough to pretty much kill any motivation anyone might have to make major changes to re itself. We're at least getting one thing right this time that we got wrong with multiprocessing, though - we're much, much further out from the 3.3 release than we were from the 2.6 release when multiprocessing was added to the standard library :) The next step needed is for someone to volunteer to write and champion a PEP that: - articulates the deficiencies in the current re module (the regex docs already cover some of this, as do Tom Christiansen's notes on the issue tracker) - explains why upgrading re in place is not feasible (e.g. noting that the availability of regex really limits the desire for anyone to reinvent that particular wheel, so even things that are theoretically possible may be highly unlikely in practice) - proposes a transition plan (personally, I'd be fine with an optparse -> argparse style transition where re remains around indefinitely to support legacy code, but new users are pointed towards regex. But depending on compatibility details, merging the two APIs in the existing re namespace may also be feasible) - proposes a maintenance strategy (I don't know how much Matthew has written regarding internal design details, but that kind of thing could really help. Matthew agreeing to continue maintenance as part of the standard library would also help a great deal, but wouldn't be enough on its own - while it's good for modules to have active maintainers to make the final call associated design decisions, it's potentially problematic when other core developers don't understand what the code is doing well enough to fix bugs in it) - confirms that the regex test suite can be incorporated cleanly into the standard library regression test suite (the difficulty of this was something that was underestimated for the inclusion of multiprocessing. Test suite integration is also the final sticking point holding up the PEP 380 'yield from' patch, although that's close to being resolved following the PyConAU sprints) - document tests conducted (e.g. micro-benchmark results, fusil results) PEP 371 (addition of multiprocessing), PEP 389 (addition of argparse) and Jesse's reflections on the way multiprocessing was added (http://jessenoller.com/2009/01/28/multiprocessing-in-hindsight/) are well worth reading for anyone considering stepping up to write a PEP. That last also highlights why even Matthew's support, however capably he has handled maintenance of regex as an independent project, wouldn't be enough - we had Richard Oudkerk's support and agreement to continue maintenance as the original author of multiprocessing, but he became unavailable early in the integration process. If Jesse hadn't been able to take up most of that slack, the likely result would have been reversion of the changes and removal of multiprocessing from the 2.6 release. Writing PEPs can be quite a frustrating experience (since a lot of feedback will be negative as people try to poke holes in the idea to see if it stands up to close scrutiny), but it's also really satisfying and rewarding if they end up getting accepted and incorporated :)
No, we make an explicit guarantee that future imports will never go away once they've been added. They may become redundant, but they won't break. There's no provision in the future mechanism for changes that are added and then later removed (see http://docs.python.org/dev/library/__future__). They're strictly for cases where backwards incompatibilities (usually, but not always, new keywords) may break existing code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote:
The next step needed is for someone to volunteer to write and champion a PEP that:
Would it be feasible and desirable to modify regex so that it *is* backwards-compatible with re, with a view to making it a drop-in replacement at some point? If not, the PEP should discuss this also. -- Greg

On 8/27/2011 7:39 PM, Greg Ewing wrote:
Many of the things regex does differently might be called either bug fixes or feature changes, depending on one's viewpoint. Regex should definitely not be 'bug-compatible'. I think regex should be unicode-standard compliant as much as possible, and let the chips fall where they may. If so, it would be like the decimal module, which closely tracks the IEEE decimal standard, rather than the binary float standard. Regex is already much more compliant than re, as shown by Tom Christiansen. This is pretty obviously intentional on MB's part. It is also probably intentional that re *not* match today's Unicode TR18 specifications. These are reasons why both Ezio and I suggested on the tracker adding regex without deleting re. (I personally would not mind just replacing re with regex, but then I have no legacy re code to break. So I am not suggesting that out of respect for those who do.) -- Terry Jan Reedy

On Sun, Aug 28, 2011 at 3:48 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I would actually prefer to replace re. Before doing that we should make a list of all the differences between the two modules (possibly in the PEP). On the regex page on PyPI there's already a list that can be used for this purpose [0]. For bug fixes it *shouldn't* be a problem if the behavior changes. New features shouldn't bring any backward-incompatible behavioral changes, and, as far as I understand, Matthew introduced the NEW flag [1], to avoid problems when they do. I think re should be kept around only if there are too many incompatibilities left and if they can't be fixed in regex. Best Regards, Ezio Melotti [0]: http://pypi.python.org/pypi/regex/0.1.20110717 [1]: "The NEW flag turns on the new behaviour of this module, which can differ from that of the 're' module, such as splitting on zero-width matches, inline flags affecting only what follows, and being able to turn inline flags off."

On Sat, Aug 27, 2011 at 5:48 PM, Terry Reedy <tjreedy@udel.edu> wrote:
Well, as you said, it depends on one's viewpoint. If there's a bug in the treatment of non-BMP character ranges, that's a bug, and fixing it shouldn't break anybody's code (unless it was worth breaking :-). But if there's a change that e.g. (hypothetical example) makes a different choice about how empty matches are treated in some edge case, and the old behavior was properly documented, that's a feature change, and I'd rather introduce a flag to select the new behavior (or, if we have to, a flag to preserve the old behavior, if the new behavior is really considered much better and much more useful).
I think regex should be unicode-standard compliant as much as possible, and let the chips fall where they may.
In most cases the Unicode improvements in regex are not where it is incompatible; e.g. adding \X and named ranges are fine new additions and IIUC the syntax was carefully designed not to introduce any incompatibilities (within the limitations of \-escapes). It's the many other "improvements" to the regex module that sometimes make it incompatible.There's a comprehensive list here: http://pypi.python.org/pypi/regex . Somebody should just go over it and for each difference make a recommendation for whether to treat this as a bugfix, a compatible new feature, or an incompatibility that requires some kind of flag. (We could have a single flag for all incompatibilities, or several flags.)
Well, I would hope that for each "major" Python version (i.e. 3.2, 3.3, 3.4, ...) we would pick a specific version of the Unicode standard and declare our desire to be compliant with that Unicode standard version, and not switch allegiances in some bugfix version (e.g. 3.2.3, 3.3.1, ...).
Regex is already much more compliant than re, as shown by Tom Christiansen.
Nobody disagrees with this or thinks it's a bad thing. :-)
This is pretty obviously intentional on MB's part.
That's also clear.
It is also probably intentional that re *not* match today's Unicode TR18 specifications.
That I'm not so sure of. I think it's more the case that TR18 evolved and that the re modules didn't -- probably mostly because nobody had the time and nobody was aware of the TR18 changes.
That option is definitely still on the table. At the very least a thorough review of the stated differences between re and regex should be done -- I trust that MR has been very thorough in his listing of those differences. The issues regarding maintenance and stability of MR's code can be solved in a number of ways -- if MR doesn't mind I would certainly be willing to give him core committer access (though I'd still recommend that he use his time primarily to train others in maintaining this important code base). -- --Guido van Rossum (python.org/~guido)

On 8/27/2011 11:54 PM, Guido van Rossum wrote:
Definitely. The unicode version would have to be frozen with beta 1 if not before. (I am quite sure the decimal module also freezes the IEEE standard version *it* follows for each Python version.) In my view, x.y is a version of the Python language while the x.y.z CPython releases are progressively better implementations of that one language, starting with x.y.0. This is the main reason I suggested that the first CPython release for the 3.3 language be called 3.3.0, as it now is. In this view, there is no question of an x.y.z+1 release changing the definition of the x.y language. -- Terry Jan Reedy

On Fri, Aug 26, 2011 at 11:01 PM, Dan Stromberg <drsalists@gmail.com> wrote: [Steven]
No, this was not the intent of __future__. The intent is that a feature is desirable but also backwards incompatible (e.g. introduces a new keyword) so that for 1 (sometimes more) releases we require the users to use the __future__ import. There was never any intent to use __future__ for experimental features. If we want that maybe we could have from __experimental__ import <whatever>. -- --Guido van Rossum (python.org/~guido)

On Sat, Aug 27, 2011 at 9:53 AM, Brian Curtin <brian.curtin@gmail.com>wrote:
I disagree. The first paragraph says this has something to do with new keywords. It doesn't appear to say what we expect users to -do- with it. Both are important. Is it "You'd better try this, because it's going in eventually. If you don't try it out before it becomes default behavior, you have no right to complain"? And if people do complain, what are python-dev's options?

On 2011-08-27, at 2:20 PM, Dan Stromberg wrote:
__future__ imports have nothing to do with "trying stuff before it comes", it has to do with backward compatibility. For example, the "with_statement" was a __future__ import because introducing the "with" keyword would break any code using "with" as a token. I don't think that the goal of introducing "with" as a future import was "we're gonna see how it pans out, and decide if we really introduce it later". __future__ means "It's coming, prepare your code".

Well, users can use the new features...
No. It's "we have that feature which will be activated in a future version. If you want to use it today, use the __future__ import. If you don't want to use it (now or in the future), just don't."
And if people do complain, what are python-dev's options?
That will depend on the complaint. If it's "I don't like the new feature", then the obvious response is "don't use it, then". Regards, Martin

Dan Stromberg wrote:
Have you read the PEP? I found it very helpful. http://www.python.org/dev/peps/pep-0236/ The motivation given in the first paragraph is pretty clear to me: __future__ is machinery added to Python to aid the transition when a backwards incompatible change is made. Perhaps it needs a note stating explicitly that it is not for trying out new features which may or may not be added at a later date. That may help prevent confusion in the, er, future. [...]
And if people do complain, what are python-dev's options?
The PEP includes a question very similar to that: Q: Going back to the nested_scopes example, what if release 2.2 comes along and I still haven't changed my code? How can I keep the 2.1 behavior then? A: By continuing to use 2.1, and not moving to 2.2 until you do change your code. The purpose of future_statement is to make life easier for people who keep current with the latest release in a timely fashion. We don't hate you if you don't, but your problems are much harder to solve, and somebody with those problems will need to write a PEP addressing them. future_statement is aimed at a different audience. To me, it's quite clear: once a feature change hits __future__, it is already part of the language. It may be an optional part for at least one release, but removing it again will require the same deprecation process as removing any other language feature (see PEP 5 for more details). -- Steven

On Aug 26, 2011, at 05:25 PM, Dan Stromberg wrote:
from __future__ import is an established way of trying something for a while to see if it's going to work.
Actually, no. The documentation says: -----snip snip----- __future__ is a real module, and serves three purposes: * To avoid confusing existing tools that analyze import statements and expect to find the modules they’re importing. * To ensure that future statements run under releases prior to 2.1 at least yield runtime exceptions (the import of __future__ will fail, because there was no module of that name prior to 2.1). * To document when incompatible changes were introduced, and when they will be — or were — made mandatory. This is a form of executable documentation, and can be inspected programmatically via importing __future__ and examining its contents. -----snip snip----- So, really the __future__ module is a way to introduce accepted but incompatible changes in a controlled way, through successive releases. It's never been used to introduce experimental features that might be removed if they don't work out. Cheers, -Barry

However, I don't know much about regex
The problem really is: nobody does (except for Matthew Barnett probably). This means that this contribution might be stuck "forever": somebody would have to review the module, identify issues, approve it, and take the blame if something breaks. That takes considerable time and has a considerable risk, for little expected glory - so nobody has volunteered to mentor/manage integration of that code. I believe most core contributors (who have run into this code) consider it worthwhile, but are just too scared to take action. Among us, some are more "regex gurus" than others; you know who you are. I guess the PSF would pay for the review, if that is what it would take. Regards, Martin

On Sat, Aug 27, 2011 at 1:57 AM, Guido van Rossum <guido@python.org> wrote:
Matthew has always been responsive on the tracker, usually fixing reported bugs in a matter of days, and I think he's willing to keep doing so once the regex module is included. Even if I haven't yet tried the module myself (I'm planning to do it though), it seems quite popular out there (the download number on PyPI apparently gets reset for each new release, so I don't know the exact total), and apparently people are already using it as a replacement of re. I'm not sure it's worth doing an extensive review of the code, a better approach might be to require extensive test coverage (and a review of tests). If the code seems well written, commented, documented (I think proper rst documentation is still missing), and tested (both with unittest and out in the wild), and Matthew is willing to maintain it, I think we can include it. We will get familiar with the code once we start contributing to it and fixing bugs, as it already happens with most of the other modules. See also the "New regex module for 3.2?" thread ( http://mail.python.org/pipermail/python-dev/2010-July/101606.html ). Best Regards, Ezio Melotti
-- --Guido van Rossum (python.org/~guido <http://python.org/%7Eguido>)

On Sat, 27 Aug 2011 04:37:21 +0300 Ezio Melotti <ezio.melotti@gmail.com> wrote:
Isn't this precisely what a review is supposed to assess?
We will get familiar with the code once we start contributing to it and fixing bugs, as it already happens with most of the other modules.
I'm not sure it's a good idea for a module with more than 10000 lines of C code (and 4000 lines of pure Python code). This is several times the size of multiprocessing. The C code looks very cleanly written, but it's still a big chunk of algorithmically sophisticated code. Another "interesting" question is whether it's easy to port to the PEP 393 string representation, if it gets accepted. Regards Antoine.

Am 27.08.2011 08:33, schrieb Terry Reedy:
That's a quality-of-implementation issue (in both cases). In principle, the modules should continue to work unmodified, and indeed SRE does. However, the module will then match on Py_UNICODE, which may be expensive to produce, and may not meet your expectations of surrogate pair handling. So realistically, the module should be ported, which has the challenge that matching needs to operate on three different representations. The modules already support two representations (unsigned char and Py_UNICODE), but probably switching on type, not on state. Regards, Martin

On Sat, 27 Aug 2011 09:18:14 +0200 "Martin v. Löwis" <martin@v.loewis.de> wrote:
From what I've seen, re generates two different sets of functions at compile-time (with a stringlib-like approach), while regex has a run-time flag to choose between the two representations (where, interestingly, the two code paths are explicitly spelled, almost duplicate of each other). Matthew, please correct me if I'm wrong. Regards Antoine.

On Sat, Aug 27, 2011 at 4:56 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
This can be done without actually knowing and understanding every single function in the module (I got the impression that someone wants this kind of review, correct me if I'm wrong).
Even unicodeobject.c is 10k+ lines of C code and I got familiar with (parts of) it just by fixing bugs in specific functions. I took a look at the regex code and it seems clear, with enough comments and several small functions that are easy to follow and understand. multiprocessing requires good knowledge of a number of concepts and platform-specific issues that makes it more difficult to understand and maintain (but maybe regex-related concepts seems easier to me because I'm already familiar with them). I think it would be good to: 1) have some document that explains the general design and main (internal) functions of the module (e.g. a PEP); 2) make a review on rietveld (possibly only of the diff with re, to limit the review to the new code only), so that people can ask questions, discuss and understand the code; 3) possibly update the document/PEP with the outcome of the rietveld review(s) and/or address the issues discussed (if any); 4) add documentation for the module and the (public) functions in Doc/library (this should be done anyway). This will ensure that the general quality of the code is good, and when someone actually has to work on the code, there's enough documentation to make it possible. Best Regards, Ezio Melotti

On Sat, Aug 27, 2011 at 8:59 PM, Ezio Melotti <ezio.melotti@gmail.com> wrote:
Wasn't me. I've long given up expecting to understand every line of code in CPython. I'm happy if the code is written in a way that makes it possible to read and understand it as the need arises.
Are you volunteering? (Even if you don't want to be the only maintainer, it still sounds like you'd be a good co-maintainer of the regex module.)
I don't think that such a document needs to be a PEP; PEPs are usually intended where there is significant discussion expected, not just to explain things. A README file or a Wiki page would be fine, as long as it's sufficiently comprehensive.
That would be an interesting exercise indeed.
3) possibly update the document/PEP with the outcome of the rietveld review(s) and/or address the issues discussed (if any);
Yeah, of course.
4) add documentation for the module and the (public) functions in Doc/library (this should be done anyway).
Does regex have a significany public C interface? (_sre.c doesn't.) Does it have a Python-level interface beyond what re.py offers (apart from the obvious new flags and new regex syntax/semantics)?
That sounds like a good description of a process that could lead to acceptance of regex as a re replacement.
Another "interesting" question is whether it's easy to port to the PEP 393 string representation, if it gets accepted.
It's very likely that PEP 393 is accepted. So likely, in fact, that I would recommend that you start porting regex to PEP 393 now. The experience would benefit both your understanding of the regex module and the quality of the PEP and its implementation. I like what I hear here! -- --Guido van Rossum (python.org/~guido)

On Sun, Aug 28, 2011 at 2:28 PM, Guido van Rossum <guido@python.org> wrote:
timsort.txt and dictnotes.txt may be useful precedents for the kind of thing that is useful on that front. IIRC, the pymalloc stuff has a massive embedded comment, which can also work. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Aug 28, 2011 at 7:28 AM, Guido van Rossum <guido@python.org> wrote:
My name is listed in the experts index for 're' [0], and that should make me already "co-maintainer" for the module.
I don't think it does. Explaining the new syntax/semantics is useful for developers (e.g.what \p and \X are supposed to match), but also for users, so it's fine to have this documented in Doc/library/re.rst (and I don't think it's necessary to duplicate it in the README/PEP/Wiki).
So if we want to get this done I think we need Matthew for 1) (unless someone else wants to do it and have him review the result). If making a diff with the current re is doable and makes sense, we can use the rietveld instance on the bug tracker to make the review for 2). The same could be done with a diff that replaces the whole module though. 3) will follow after 2), and 4) is not difficult and can be done when we actually replace re (it's probably enough to reorganize a bit and convert to rst the page on PyPI). Best Regards, Ezio Melotti [0]: http://docs.python.org/devguide/experts.html#stdlib

Am 27.08.2011 12:10, schrieb Antoine Pitrou:
Well, the reviewer would also have to dive into the code details, e.g. through Rietveld. Of course, referencing the Rietveld issue in the PEP might be appropriate. A PEP should IMO only cover end-user aspects of the new re module. Code organization is typically not in the PEP. To give a specific example: you mentioned that there is (near) code duplication MRAB's module. As a reviewer, I would discuss whether this can be eliminated - but not in the PEP. Regards, Martin

Guido van Rossum wrote:
Why not simply add the new lib, see whether it works out and then decide which path to follow. We've done that with the old regex lib. It took a few years and releases to have people port their applications to the then new re module and syntax, but in the end it worked. With a new regex library there are likely going to be quite a few subtle differences between re and regex - even if it's just doing things in a more Unicode compatible way. I don't think anyone can actually list all the differences given the complex nature of regular expressions, so people will likely need a few years and releases to get used it before a switch can be made. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2011)
2011-10-04: PyCon DE 2011, Leipzig, Germany 38 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Fri, Aug 26, 2011 at 3:09 PM, M.-A. Lemburg <mal@egenix.com> wrote:
I can't say I liked how that transition was handled last time around. I really don't want to have to tell people "Oh, that bug is fixed but you have to use regex instead of re" and then a few years later have to tell them "Oh, we're deprecating regex, you should just use re". I'm really hoping someone has more actual technical understanding of re vs. regex and can give us some facts about the differences, rather than, frankly, FUD. -- --Guido van Rossum (python.org/~guido)

On Fri, 26 Aug 2011 15:18:35 -0700 Guido van Rossum <guido@python.org> wrote:
The best way would be to contact the author, Matthew Barnett, or to ask on the tracker on http://bugs.python.org/issue2636. He has been quite willing to answer such questions in the past, AFAIR. Regards Antoine.

On Fri, Aug 26, 2011 at 3:33 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I had added him to the beginning of this thread but someone took him off.
So, that issue is about something called "regexp". AFAIK Matthew (MRAB) wrote something called "regex" (http://pypi.python.org/pypi/regex). Are they two different things??? -- --Guido van Rossum (python.org/~guido)

On Fri, 26 Aug 2011 15:47:21 -0700 Guido van Rossum <guido@python.org> wrote:
No, it's the same. The source is at https://code.google.com/p/mrab-regex-hg/, btw. Regards Antoine.

Guido van Rossum wrote:
No, you tell them: "If you want Unicode 6 semantics, use regex, if you're fine with Unicode 2.0/3.0 semantics, use re". After all, it's not like re suddenly stopped working :-)
The good part is that it's based on the re code, the FUD comes from the fact that the new lib is 380kB larger than the old one and that's not even counting the generated 500kB of lookup tables. If no one steps up to do a review or analysis, I think the only practical way to test the lib is to give it a prominent chance to prove itself. The other aspect is maintenance. Perhaps we could have a summer of code student do a review and analysis to get familiar with the code and then have at least two developers know the code well enough to support it for a while. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Aug 27 2011)
2011-10-04: PyCon DE 2011, Leipzig, Germany 38 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sat, 27 Aug 2011 01:00:31 +0200 "M.-A. Lemburg" <mal@egenix.com> wrote:
It has a whole lot of new features in addition to better unicode support. See for yourself: https://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails
I'm not sure a GSoC student would be the best candidate to do a review matching our expectations. Regards Antoine.

"M.-A. Lemburg" <mal@egenix.com> writes:
Guido van Rossum wrote:
What do we say, then, to those who are unaware of the different semantics between those versions of Unicode, and want regular expression to “just work” in Python? To which document can we direct them to understand what semantics they want?
After all, it's not like re suddenly stopped working :-)
For some value of “working”, that is. The trick is to know whether that value is what one wants. -- \ “The fact of your own existence is the most astonishing fact | `\ you'll ever have to confront. Don't dare ever see your life as | _o__) boring, monotonous, or joyless.” —Richard Dawkins, 2010-03-10 | Ben Finney

Ben Finney wrote:
"M.-A. Lemburg" <mal@egenix.com> writes:
Presumably, like all modules, both the re and the regex module will have their own individual pages in the library reference. As the newcomer, regex should include a discussion of differences between the two. This can then be quietly dropped once re becomes formally deprecated. (Assuming that the std lib keeps re and regex in parallel for a few releases, which is not a given.) However, I note that last time, the old regex module was just documented as obsolete with little detailed discussion of the differences: http://docs.python.org/release/1.5/lib/node69.html#SECTION005300000000000000... -- Steven

Steven D'Aprano <steve@pearwood.info> writes:
My question is directed more to M-A Lemburg's passage above, and its implicit assumption that the user understand the changes between “Unicode 2.0/3.0 semantics” and “Unicode 6 semantics”, and how their own needs relate to those semantics. For programmers who know they want to follow Unicode conventions in Python, but don't know the distinction M-A Lemburg is drawing, to which document does he recommend we direct them? “The Unicode specification document in its various versions” isn't a feasible answer. -- \ “Computers are useless. They can only give you answers.” —Pablo | `\ Picasso | _o__) | Ben Finney

Ben Finney wrote:
I can only repeat my answer: the docs for the new regex module should include a discussion of the differences. If that requires summarising the differences that M-A Lemburg refers to, then so be it.
“The Unicode specification document in its various versions” isn't a feasible answer.
Presumably the Unicode spec will be the canonical source, but I agree that we should not expect people to read that in order to make a decision between re and regex. -- Steven

On Fri, Aug 26, 2011 at 2:45 PM, Guido van Rossum <guido@python.org> wrote:
...but on second thought I wonder if maybe regex is mature enough to replace re in Python 3.3.
I agree that the move from regex to re was kind of painful. It seems someone should merge the unit tests for re and regex, and apply the merged result to each for the sake of comparison. There might also be a need to expand the merged result to include new things. Then there probably should be a from __future__ import for a while.

On Fri, 26 Aug 2011 15:48:42 -0700 Dan Stromberg <drsalists@gmail.com> wrote:
Then there probably should be a from __future__ import for a while.
If you are willing to use a "from __future__ import", why not simply import regex as re ? We're not Perl, we don't have built-in syntactic support for regular expressions. Regards Antoine.

On Fri, Aug 26, 2011 at 5:08 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
If you add regex as "import regex", and the new regex module doesn't work out, regex might be harder to get rid of. from __future__ import is an established way of trying something for a while to see if it's going to work. EG: "from __future__ import re", where re is really the new module. But whatever.

On Fri, 26 Aug 2011 17:25:56 -0700 Dan Stromberg <drsalists@gmail.com> wrote:
That's an interesting idea. This way, integrating the new module would be a less risky move, since if it gives us too many problems, we could back out our decision in the next feature release. Regards Antoine.

Antoine Pitrou wrote:
I'm not sure that's correct. If there are differences in either the interface or the behaviour between the new regex and re, then reverting will be a pain regardless of whether you have: from __future__ import re re.compile(...) or import regex regex.compile(...) Either way, if the new regex library goes away, code will break, and fixing it may not be easy. It's not likely to be so easy that merely deleting the "from __future__ ..." line will do it, but if it is that easy, then using "import re as regex" will be just as easy. Have then been any __future__ features that were added provisionally? I can't think of any. That's not what __future__ is for, at least according to PEP 236. http://www.python.org/dev/peps/pep-0236/ I can't think of any __future__ feature that could be easily reverted once people start relying on it. Either syntax would break, or behaviour would change. The PEP even explicitly states that __future__ should not be used for changes which are backward compatible: Note that there is no need to involve the future_statement machinery in new features unless they can break existing code; fully backward- compatible additions can-- and should --be introduced without a corresponding future_statement. I wasn't around for the move from 1.4 regex to 1.5 re, so I don't know what was done poorly last time. But I can't see why we should treat regular expressions so differently from (say) argparse and optparse. from __future__ import optparse No. Just... no. -- Steven

On Fri, Aug 26, 2011 at 8:47 PM, Steven D'Aprano <steve@pearwood.info>wrote:
You're talking technically, which is important, but wasn't what I was suggesting would be helped. Politically, and from a marketing standpoint, it's easier to withdraw a feature you've given with a "Play with this, see if it works for you" warning. Have then been any __future__ features that were added provisionally?
I can't either, but ISTR hearing that from __future__ import was started with such an intent. Irrespective, it's hard to import something from "future" without at least suspecting that you're on the bleeding edge.

I can't either, but ISTR hearing that from __future__ import was started with such an intent.
No, not at all. The original intention was to enable features that would definitely would be added, not just right now. Tim Peters always objected to claims that future imports were talking about provisional features.
We don't want to add features to Python that we may have to withdraw. If there is doubt whether they should be added, they shouldn't be added. If they do get added, we have to live with it (until, say, Python 4, where bad features can be removed again). Regards, Martin

On Sat, Aug 27, 2011 at 4:01 PM, Dan Stromberg <drsalists@gmail.com> wrote:
The standard library isn't for playing. "pip install regex" is for playing. If we aren't sure we want to make the transition, then it doesn't go in. However, to my mind, reviewing and incorporating regex is a far more feasible model than trying to enhance the existing re module with a comparable feature set. At the moment, there's already an obvious way to get enhanced regex support in Python: install regex and use it instead of the standard library's re module. That's enough to pretty much kill any motivation anyone might have to make major changes to re itself. We're at least getting one thing right this time that we got wrong with multiprocessing, though - we're much, much further out from the 3.3 release than we were from the 2.6 release when multiprocessing was added to the standard library :) The next step needed is for someone to volunteer to write and champion a PEP that: - articulates the deficiencies in the current re module (the regex docs already cover some of this, as do Tom Christiansen's notes on the issue tracker) - explains why upgrading re in place is not feasible (e.g. noting that the availability of regex really limits the desire for anyone to reinvent that particular wheel, so even things that are theoretically possible may be highly unlikely in practice) - proposes a transition plan (personally, I'd be fine with an optparse -> argparse style transition where re remains around indefinitely to support legacy code, but new users are pointed towards regex. But depending on compatibility details, merging the two APIs in the existing re namespace may also be feasible) - proposes a maintenance strategy (I don't know how much Matthew has written regarding internal design details, but that kind of thing could really help. Matthew agreeing to continue maintenance as part of the standard library would also help a great deal, but wouldn't be enough on its own - while it's good for modules to have active maintainers to make the final call associated design decisions, it's potentially problematic when other core developers don't understand what the code is doing well enough to fix bugs in it) - confirms that the regex test suite can be incorporated cleanly into the standard library regression test suite (the difficulty of this was something that was underestimated for the inclusion of multiprocessing. Test suite integration is also the final sticking point holding up the PEP 380 'yield from' patch, although that's close to being resolved following the PyConAU sprints) - document tests conducted (e.g. micro-benchmark results, fusil results) PEP 371 (addition of multiprocessing), PEP 389 (addition of argparse) and Jesse's reflections on the way multiprocessing was added (http://jessenoller.com/2009/01/28/multiprocessing-in-hindsight/) are well worth reading for anyone considering stepping up to write a PEP. That last also highlights why even Matthew's support, however capably he has handled maintenance of regex as an independent project, wouldn't be enough - we had Richard Oudkerk's support and agreement to continue maintenance as the original author of multiprocessing, but he became unavailable early in the integration process. If Jesse hadn't been able to take up most of that slack, the likely result would have been reversion of the changes and removal of multiprocessing from the 2.6 release. Writing PEPs can be quite a frustrating experience (since a lot of feedback will be negative as people try to poke holes in the idea to see if it stands up to close scrutiny), but it's also really satisfying and rewarding if they end up getting accepted and incorporated :)
No, we make an explicit guarantee that future imports will never go away once they've been added. They may become redundant, but they won't break. There's no provision in the future mechanism for changes that are added and then later removed (see http://docs.python.org/dev/library/__future__). They're strictly for cases where backwards incompatibilities (usually, but not always, new keywords) may break existing code. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Nick Coghlan wrote:
The next step needed is for someone to volunteer to write and champion a PEP that:
Would it be feasible and desirable to modify regex so that it *is* backwards-compatible with re, with a view to making it a drop-in replacement at some point? If not, the PEP should discuss this also. -- Greg

On 8/27/2011 7:39 PM, Greg Ewing wrote:
Many of the things regex does differently might be called either bug fixes or feature changes, depending on one's viewpoint. Regex should definitely not be 'bug-compatible'. I think regex should be unicode-standard compliant as much as possible, and let the chips fall where they may. If so, it would be like the decimal module, which closely tracks the IEEE decimal standard, rather than the binary float standard. Regex is already much more compliant than re, as shown by Tom Christiansen. This is pretty obviously intentional on MB's part. It is also probably intentional that re *not* match today's Unicode TR18 specifications. These are reasons why both Ezio and I suggested on the tracker adding regex without deleting re. (I personally would not mind just replacing re with regex, but then I have no legacy re code to break. So I am not suggesting that out of respect for those who do.) -- Terry Jan Reedy

On Sun, Aug 28, 2011 at 3:48 AM, Terry Reedy <tjreedy@udel.edu> wrote:
I would actually prefer to replace re. Before doing that we should make a list of all the differences between the two modules (possibly in the PEP). On the regex page on PyPI there's already a list that can be used for this purpose [0]. For bug fixes it *shouldn't* be a problem if the behavior changes. New features shouldn't bring any backward-incompatible behavioral changes, and, as far as I understand, Matthew introduced the NEW flag [1], to avoid problems when they do. I think re should be kept around only if there are too many incompatibilities left and if they can't be fixed in regex. Best Regards, Ezio Melotti [0]: http://pypi.python.org/pypi/regex/0.1.20110717 [1]: "The NEW flag turns on the new behaviour of this module, which can differ from that of the 're' module, such as splitting on zero-width matches, inline flags affecting only what follows, and being able to turn inline flags off."

On Sat, Aug 27, 2011 at 5:48 PM, Terry Reedy <tjreedy@udel.edu> wrote:
Well, as you said, it depends on one's viewpoint. If there's a bug in the treatment of non-BMP character ranges, that's a bug, and fixing it shouldn't break anybody's code (unless it was worth breaking :-). But if there's a change that e.g. (hypothetical example) makes a different choice about how empty matches are treated in some edge case, and the old behavior was properly documented, that's a feature change, and I'd rather introduce a flag to select the new behavior (or, if we have to, a flag to preserve the old behavior, if the new behavior is really considered much better and much more useful).
I think regex should be unicode-standard compliant as much as possible, and let the chips fall where they may.
In most cases the Unicode improvements in regex are not where it is incompatible; e.g. adding \X and named ranges are fine new additions and IIUC the syntax was carefully designed not to introduce any incompatibilities (within the limitations of \-escapes). It's the many other "improvements" to the regex module that sometimes make it incompatible.There's a comprehensive list here: http://pypi.python.org/pypi/regex . Somebody should just go over it and for each difference make a recommendation for whether to treat this as a bugfix, a compatible new feature, or an incompatibility that requires some kind of flag. (We could have a single flag for all incompatibilities, or several flags.)
Well, I would hope that for each "major" Python version (i.e. 3.2, 3.3, 3.4, ...) we would pick a specific version of the Unicode standard and declare our desire to be compliant with that Unicode standard version, and not switch allegiances in some bugfix version (e.g. 3.2.3, 3.3.1, ...).
Regex is already much more compliant than re, as shown by Tom Christiansen.
Nobody disagrees with this or thinks it's a bad thing. :-)
This is pretty obviously intentional on MB's part.
That's also clear.
It is also probably intentional that re *not* match today's Unicode TR18 specifications.
That I'm not so sure of. I think it's more the case that TR18 evolved and that the re modules didn't -- probably mostly because nobody had the time and nobody was aware of the TR18 changes.
That option is definitely still on the table. At the very least a thorough review of the stated differences between re and regex should be done -- I trust that MR has been very thorough in his listing of those differences. The issues regarding maintenance and stability of MR's code can be solved in a number of ways -- if MR doesn't mind I would certainly be willing to give him core committer access (though I'd still recommend that he use his time primarily to train others in maintaining this important code base). -- --Guido van Rossum (python.org/~guido)

On 8/27/2011 11:54 PM, Guido van Rossum wrote:
Definitely. The unicode version would have to be frozen with beta 1 if not before. (I am quite sure the decimal module also freezes the IEEE standard version *it* follows for each Python version.) In my view, x.y is a version of the Python language while the x.y.z CPython releases are progressively better implementations of that one language, starting with x.y.0. This is the main reason I suggested that the first CPython release for the 3.3 language be called 3.3.0, as it now is. In this view, there is no question of an x.y.z+1 release changing the definition of the x.y language. -- Terry Jan Reedy

On Fri, Aug 26, 2011 at 11:01 PM, Dan Stromberg <drsalists@gmail.com> wrote: [Steven]
No, this was not the intent of __future__. The intent is that a feature is desirable but also backwards incompatible (e.g. introduces a new keyword) so that for 1 (sometimes more) releases we require the users to use the __future__ import. There was never any intent to use __future__ for experimental features. If we want that maybe we could have from __experimental__ import <whatever>. -- --Guido van Rossum (python.org/~guido)

On Sat, Aug 27, 2011 at 9:53 AM, Brian Curtin <brian.curtin@gmail.com>wrote:
I disagree. The first paragraph says this has something to do with new keywords. It doesn't appear to say what we expect users to -do- with it. Both are important. Is it "You'd better try this, because it's going in eventually. If you don't try it out before it becomes default behavior, you have no right to complain"? And if people do complain, what are python-dev's options?

On 2011-08-27, at 2:20 PM, Dan Stromberg wrote:
__future__ imports have nothing to do with "trying stuff before it comes", it has to do with backward compatibility. For example, the "with_statement" was a __future__ import because introducing the "with" keyword would break any code using "with" as a token. I don't think that the goal of introducing "with" as a future import was "we're gonna see how it pans out, and decide if we really introduce it later". __future__ means "It's coming, prepare your code".

Well, users can use the new features...
No. It's "we have that feature which will be activated in a future version. If you want to use it today, use the __future__ import. If you don't want to use it (now or in the future), just don't."
And if people do complain, what are python-dev's options?
That will depend on the complaint. If it's "I don't like the new feature", then the obvious response is "don't use it, then". Regards, Martin

Dan Stromberg wrote:
Have you read the PEP? I found it very helpful. http://www.python.org/dev/peps/pep-0236/ The motivation given in the first paragraph is pretty clear to me: __future__ is machinery added to Python to aid the transition when a backwards incompatible change is made. Perhaps it needs a note stating explicitly that it is not for trying out new features which may or may not be added at a later date. That may help prevent confusion in the, er, future. [...]
And if people do complain, what are python-dev's options?
The PEP includes a question very similar to that: Q: Going back to the nested_scopes example, what if release 2.2 comes along and I still haven't changed my code? How can I keep the 2.1 behavior then? A: By continuing to use 2.1, and not moving to 2.2 until you do change your code. The purpose of future_statement is to make life easier for people who keep current with the latest release in a timely fashion. We don't hate you if you don't, but your problems are much harder to solve, and somebody with those problems will need to write a PEP addressing them. future_statement is aimed at a different audience. To me, it's quite clear: once a feature change hits __future__, it is already part of the language. It may be an optional part for at least one release, but removing it again will require the same deprecation process as removing any other language feature (see PEP 5 for more details). -- Steven

On Aug 26, 2011, at 05:25 PM, Dan Stromberg wrote:
from __future__ import is an established way of trying something for a while to see if it's going to work.
Actually, no. The documentation says: -----snip snip----- __future__ is a real module, and serves three purposes: * To avoid confusing existing tools that analyze import statements and expect to find the modules they’re importing. * To ensure that future statements run under releases prior to 2.1 at least yield runtime exceptions (the import of __future__ will fail, because there was no module of that name prior to 2.1). * To document when incompatible changes were introduced, and when they will be — or were — made mandatory. This is a form of executable documentation, and can be inspected programmatically via importing __future__ and examining its contents. -----snip snip----- So, really the __future__ module is a way to introduce accepted but incompatible changes in a controlled way, through successive releases. It's never been used to introduce experimental features that might be removed if they don't work out. Cheers, -Barry

However, I don't know much about regex
The problem really is: nobody does (except for Matthew Barnett probably). This means that this contribution might be stuck "forever": somebody would have to review the module, identify issues, approve it, and take the blame if something breaks. That takes considerable time and has a considerable risk, for little expected glory - so nobody has volunteered to mentor/manage integration of that code. I believe most core contributors (who have run into this code) consider it worthwhile, but are just too scared to take action. Among us, some are more "regex gurus" than others; you know who you are. I guess the PSF would pay for the review, if that is what it would take. Regards, Martin

On Sat, Aug 27, 2011 at 1:57 AM, Guido van Rossum <guido@python.org> wrote:
Matthew has always been responsive on the tracker, usually fixing reported bugs in a matter of days, and I think he's willing to keep doing so once the regex module is included. Even if I haven't yet tried the module myself (I'm planning to do it though), it seems quite popular out there (the download number on PyPI apparently gets reset for each new release, so I don't know the exact total), and apparently people are already using it as a replacement of re. I'm not sure it's worth doing an extensive review of the code, a better approach might be to require extensive test coverage (and a review of tests). If the code seems well written, commented, documented (I think proper rst documentation is still missing), and tested (both with unittest and out in the wild), and Matthew is willing to maintain it, I think we can include it. We will get familiar with the code once we start contributing to it and fixing bugs, as it already happens with most of the other modules. See also the "New regex module for 3.2?" thread ( http://mail.python.org/pipermail/python-dev/2010-July/101606.html ). Best Regards, Ezio Melotti
-- --Guido van Rossum (python.org/~guido <http://python.org/%7Eguido>)

On Sat, 27 Aug 2011 04:37:21 +0300 Ezio Melotti <ezio.melotti@gmail.com> wrote:
Isn't this precisely what a review is supposed to assess?
We will get familiar with the code once we start contributing to it and fixing bugs, as it already happens with most of the other modules.
I'm not sure it's a good idea for a module with more than 10000 lines of C code (and 4000 lines of pure Python code). This is several times the size of multiprocessing. The C code looks very cleanly written, but it's still a big chunk of algorithmically sophisticated code. Another "interesting" question is whether it's easy to port to the PEP 393 string representation, if it gets accepted. Regards Antoine.

Am 27.08.2011 08:33, schrieb Terry Reedy:
That's a quality-of-implementation issue (in both cases). In principle, the modules should continue to work unmodified, and indeed SRE does. However, the module will then match on Py_UNICODE, which may be expensive to produce, and may not meet your expectations of surrogate pair handling. So realistically, the module should be ported, which has the challenge that matching needs to operate on three different representations. The modules already support two representations (unsigned char and Py_UNICODE), but probably switching on type, not on state. Regards, Martin

On Sat, 27 Aug 2011 09:18:14 +0200 "Martin v. Löwis" <martin@v.loewis.de> wrote:
From what I've seen, re generates two different sets of functions at compile-time (with a stringlib-like approach), while regex has a run-time flag to choose between the two representations (where, interestingly, the two code paths are explicitly spelled, almost duplicate of each other). Matthew, please correct me if I'm wrong. Regards Antoine.

On Sat, Aug 27, 2011 at 4:56 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
This can be done without actually knowing and understanding every single function in the module (I got the impression that someone wants this kind of review, correct me if I'm wrong).
Even unicodeobject.c is 10k+ lines of C code and I got familiar with (parts of) it just by fixing bugs in specific functions. I took a look at the regex code and it seems clear, with enough comments and several small functions that are easy to follow and understand. multiprocessing requires good knowledge of a number of concepts and platform-specific issues that makes it more difficult to understand and maintain (but maybe regex-related concepts seems easier to me because I'm already familiar with them). I think it would be good to: 1) have some document that explains the general design and main (internal) functions of the module (e.g. a PEP); 2) make a review on rietveld (possibly only of the diff with re, to limit the review to the new code only), so that people can ask questions, discuss and understand the code; 3) possibly update the document/PEP with the outcome of the rietveld review(s) and/or address the issues discussed (if any); 4) add documentation for the module and the (public) functions in Doc/library (this should be done anyway). This will ensure that the general quality of the code is good, and when someone actually has to work on the code, there's enough documentation to make it possible. Best Regards, Ezio Melotti

On Sat, Aug 27, 2011 at 8:59 PM, Ezio Melotti <ezio.melotti@gmail.com> wrote:
Wasn't me. I've long given up expecting to understand every line of code in CPython. I'm happy if the code is written in a way that makes it possible to read and understand it as the need arises.
Are you volunteering? (Even if you don't want to be the only maintainer, it still sounds like you'd be a good co-maintainer of the regex module.)
I don't think that such a document needs to be a PEP; PEPs are usually intended where there is significant discussion expected, not just to explain things. A README file or a Wiki page would be fine, as long as it's sufficiently comprehensive.
That would be an interesting exercise indeed.
3) possibly update the document/PEP with the outcome of the rietveld review(s) and/or address the issues discussed (if any);
Yeah, of course.
4) add documentation for the module and the (public) functions in Doc/library (this should be done anyway).
Does regex have a significany public C interface? (_sre.c doesn't.) Does it have a Python-level interface beyond what re.py offers (apart from the obvious new flags and new regex syntax/semantics)?
That sounds like a good description of a process that could lead to acceptance of regex as a re replacement.
Another "interesting" question is whether it's easy to port to the PEP 393 string representation, if it gets accepted.
It's very likely that PEP 393 is accepted. So likely, in fact, that I would recommend that you start porting regex to PEP 393 now. The experience would benefit both your understanding of the regex module and the quality of the PEP and its implementation. I like what I hear here! -- --Guido van Rossum (python.org/~guido)

On Sun, Aug 28, 2011 at 2:28 PM, Guido van Rossum <guido@python.org> wrote:
timsort.txt and dictnotes.txt may be useful precedents for the kind of thing that is useful on that front. IIRC, the pymalloc stuff has a massive embedded comment, which can also work. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sun, Aug 28, 2011 at 7:28 AM, Guido van Rossum <guido@python.org> wrote:
My name is listed in the experts index for 're' [0], and that should make me already "co-maintainer" for the module.
I don't think it does. Explaining the new syntax/semantics is useful for developers (e.g.what \p and \X are supposed to match), but also for users, so it's fine to have this documented in Doc/library/re.rst (and I don't think it's necessary to duplicate it in the README/PEP/Wiki).
So if we want to get this done I think we need Matthew for 1) (unless someone else wants to do it and have him review the result). If making a diff with the current re is doable and makes sense, we can use the rietveld instance on the bug tracker to make the review for 2). The same could be done with a diff that replaces the whole module though. 3) will follow after 2), and 4) is not difficult and can be done when we actually replace re (it's probably enough to reorganize a bit and convert to rst the page on PyPI). Best Regards, Ezio Melotti [0]: http://docs.python.org/devguide/experts.html#stdlib

Am 27.08.2011 12:10, schrieb Antoine Pitrou:
Well, the reviewer would also have to dive into the code details, e.g. through Rietveld. Of course, referencing the Rietveld issue in the PEP might be appropriate. A PEP should IMO only cover end-user aspects of the new re module. Code organization is typically not in the PEP. To give a specific example: you mentioned that there is (near) code duplication MRAB's module. As a reviewer, I would discuss whether this can be eliminated - but not in the PEP. Regards, Martin
participants (17)
-
"Martin v. Löwis"
-
Antoine Pitrou
-
Barry Warsaw
-
Ben Finney
-
Brian Curtin
-
Dan Stromberg
-
exarkun@twistedmatrix.com
-
Ezio Melotti
-
Greg Ewing
-
Guido van Rossum
-
M.-A. Lemburg
-
MRAB
-
Nick Coghlan
-
Steven D'Aprano
-
Terry Reedy
-
Tom Christiansen
-
Virgil Dupras