On re / regex replacement

Sun Aug 28 14:40:36 EDT 2011

On 28/08/2011 14:40, Vlastimil Brom wrote:
> 2011/8/28 jmfauth<wxjmfauth at gmail.com>:
>> There is actually a discussion on the dev-list about the replacement
>> of "re" by "regex".
>> ...
>> If I can undestand the ASCII flag, ASCII being the "lingua franca" of
>> almost all codings, I am more skeptical about the LOCALE/UNICODE
>> flags.
>>
>> There is in my mind some kind of conflict here. What is 100% unicode
>> compliant shoud be locale independent ("Unicode.org") and a locale
>> depedency means a loss of unicode compliance.
>>
>> I'm fearing some potential problems here:  Users or modules working
>> in one mode, while some others are working in the other mode.
>>
>> ...
>> jmf
>>
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>>
>
>
> As I understand it, regex was designed to be as much compatible with
> re as possible, sometimes even some problematic (in some
> interpretation) behaviour is retained as default and "corrected" via
> the NEW flag (e.g. zero-width split). Also the LOCALE flag seems to be
> considered as legacy feature and kept with the same behaviour like re;
> cf.: http://code.google.com/p/mrab-regex-hg/issues/detail?id=6&can=1
> In my opinon, the LOCALE flag is not reliable (in a way I would
> imagine) in either re or regex.
>
In Python 2, re defaults to ASCII and you must use UNICODE for Unicode
strings (the str type is a bytestring). In Python 3, re defaults to
UNICODE and you must use ASCII for ASCII bytestrings (the str type is a
Unicode string).

The LOCALE flag is for locale-dependent 8-bit bytestrings. It uses the
toupper and tolower functions of the underlying C library.

The regex module tries to be drop-in compatible. It supports the LOCALE
flag only because the re module has it. Even Perl has something similar.

> In the area of flags regex should work the same way like re or it just
> adds more possibilities (REVERSE for backwards search,  ASCII as the
> complement for unicode, NEW to enable some incompatible additions or
> corrections, where the original behaviour could be relied on).
>
> The only (understandable) incompatibility I encounter in regex are the
> new features requiring special syntax, which would obviously raise
> errors in re or which would be matched literally instead.
> see
> http://code.google.com/p/mrab-regex-hg/wiki/GeneralDetails#Additional_features
> for an overview of the additions.
>
In the re module, unknown escape sequences are treated as literals, eg
\K is treated as K.

The regex module has more escape sequences, so that may break existing
regexes, eg \X isn't treated as X, but matches a grapheme. Unknown
escape sequences are still treated as literals, as in re.

My view is that you shouldn't be relying on that behaviour. If it looks
like an escape sequence, it may very well be one. It's like their use
in strings literals for file paths on Windows. I would've preferred
that a invalid escape sequence in a string literal raised an exception
(either it's valid and has a meaning, or it's invalid/reserved for
future use).

It's a balancing act. Requiring the NEW flag for _any_ deviation from
re would be very annoying.

> Personally I am very happy with regex, both with its features as well
> as with the support and maintenance by its developer;
> however I am mostly using it for manually entered patterns, and less
> for hardcoded operation.
>
And I'm very happy with your feedback. ;-)