[Python-Dev] Should we move to replace re with regex?

Sun Aug 28 06:28:17 CEST 2011

On Sat, Aug 27, 2011 at 8:59 PM, Ezio Melotti <ezio.melotti at gmail.com> wrote:
> On Sat, Aug 27, 2011 at 4:56 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>>
>> On Sat, 27 Aug 2011 04:37:21 +0300
>> Ezio Melotti <ezio.melotti at gmail.com> wrote:
>> >
>> > I'm not sure it's worth doing an extensive review of the code, a better
>> > approach might be to require extensive test coverage  (and a review of
>> > tests).  If the code seems well written, commented, documented (I think
>> > proper rst documentation is still missing),
>>
>> Isn't this precisely what a review is supposed to assess?
>
> This can be done without actually knowing and understanding every single
> function in the module (I got the impression that someone wants this kind of
> review, correct me if I'm wrong).

Wasn't me. I've long given up expecting to understand every line of
code in CPython. I'm happy if the code is written in a way that makes
it possible to read and understand it as the need arises.

>> > We will get familiar with the code once we start contributing
>> > to it and fixing bugs, as it already happens with most of the other
>> > modules.
>>
>> I'm not sure it's a good idea for a module with more than 10000 lines
>> of C code (and 4000 lines of pure Python code). This is several times
>> the size of multiprocessing. The C code looks very cleanly written, but
>> it's still a big chunk of algorithmically sophisticated code.
>
> Even unicodeobject.c is 10k+ lines of C code and I got familiar with (parts
> of) it just by fixing bugs in specific functions.
> I took a look at the regex code and it seems clear, with enough comments and
> several small functions that are easy to follow and understand.
> multiprocessing requires good knowledge of a number of concepts and
> platform-specific issues that makes it more difficult to understand and
> maintain (but maybe regex-related concepts seems easier to me because I'm
> already familiar with them).

Are you volunteering? (Even if you don't want to be the only
maintainer, it still sounds like you'd be a good co-maintainer of the
regex module.)

> I think it would be good to:
>   1) have some document that explains the general design and main (internal)
> functions of the module (e.g. a PEP);

I don't think that such a document needs to be a PEP; PEPs are usually
intended where there is significant discussion expected, not just to
explain things. A README file or a Wiki page would be fine, as long as
it's sufficiently comprehensive.

>   2) make a review on rietveld (possibly only of the diff with re, to limit
> the review to the new code only), so that people can ask questions, discuss
> and understand the code;

That would be an interesting exercise indeed.

>   3) possibly update the document/PEP with the outcome of the rietveld
> review(s) and/or address the issues discussed (if any);

Yeah, of course.

>   4) add documentation for the module and the (public) functions in
> Doc/library (this should be done anyway).

Does regex have a significany public C interface? (_sre.c doesn't.)
Does it have a Python-level interface beyond what re.py offers (apart
from the obvious new flags and new regex syntax/semantics)?

> This will ensure that the general quality of the code is good, and when
> someone actually has to work on the code, there's enough documentation to
> make it possible.

That sounds like a good description of a process that could lead to
acceptance of regex as a re replacement.

>> Another "interesting" question is whether it's easy to port to the PEP
>> 393 string representation, if it gets accepted.

It's very likely that PEP 393 is accepted. So likely, in fact, that I
would recommend that you start porting regex to PEP 393 now. The
experience would benefit both your understanding of the regex module
and the quality of the PEP and its implementation.

I like what I hear here!

-- 
--Guido van Rossum (python.org/~guido)