[Python-Dev] Zero-width matching in regexes

Serhiy Storchaka storchaka at gmail.com
Wed Dec 13 10:26:11 EST 2017


05.12.17 01:21, MRAB пише:
> I've finally come to a conclusion as to what the "correct" behaviour of 
> zero-width matches should be: """always return the first match, but 
> never a zero-width match that is joined to a previous zero-width match""".
> 
> If it's about to return a zero-width match that's joined to a previous 
> zero-width match, then backtrack and keep on looking for a match.
> 
> Example:
> 
>  >>> print([m.span() for m in re.finditer(r'|.', 'a')])
> [(0, 0), (0, 1), (1, 1)]
> 
> re.findall, re.split and re.sub should work accordingly.
> 
> If re.finditer finds n matches, then re.split should return a list of 
> n+1 strings and re.sub should make n replacements (excepting maxsplit, 
> etc.).

We now have a good opportunity of changing a long standing behavior of 
re.sub(). Currently empty matches are prohibited if adjacent to a 
previous match. For consistency with re.finditer() and re.findall(), 
with regex.sub() with VERSION1 flag, and with Perl, PCRE and other 
engines they should be prohibited only if adjacent to a previous *empty* 
match. Currently re.sub('x*', '-', 'abxc') returns '-a-b-c-', but will 
return '-a-b--c-' if change the behavior.

This behavior already was unintentionally temporary changed between 2.1 
and 2.2, when the underlying implementation of re was changed from PCRE 
to SRE. But the former behavior was quickly restored (see 
https://bugs.python.org/issue462270). Ironically the behavior of the 
current PCRE is different.

Possible options:

1. Change the behavior right now.
2. Start emitting a FutureWarning and change the behavior in future version.
3. Keep the status quo forever.

We need to make a decision right now since in the first two cases we 
should to change the behavior of re.split() right now. Its behavior is 
changed in 3.7 in any case, and it is better to change the behavior once 
than break the behavior in two different releases.

The changed detail is so subtle that no regular expressions in the 
stdlib and tests are affected, except the special purposed test added 
for guarding the current behavior.



More information about the Python-Dev mailing list