I trust your instincts and powers of analysis here. Maybe MRAB has some useful feedback on the tar in the honey?

On Tue, Nov 28, 2017 at 12:04 PM, Serhiy Storchaka <storchaka@gmail.com> wrote:
The two largest problems in the re module are splitting on zero-width patterns and complete and correct support of the Unicode standard. These problems are solved in regex. regex has many other features, but they are less important.

I want to tell the problem of splitting on zero-width patterns. It already was discussed on Python-Dev 13 years ago [3] and maybe later. See also issues: [4], [5], [6], [7], [8].

In short it doesn't work. Splitting on the pattern r'\b' doesn't split the text at boundaries of words, and splitting on the pattern r'\s+|(?<=-)' will split the text on whitespaces, but will not split words with hypens as expected.

In Python 3.4 and earlier:

>>> re.split(r'\b', 'Self-Defence Class')
['Self-Defence Class']
>>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
['Self-Defence', 'Class']
>>> re.split(r'\s*', 'Self-Defence Class')
['Self-Defence', 'Class']

Note that splitting on r'\s*' (0 or more whitespaces) actually split on r'\s+' (1 or more whitespaces). Splitting on patterns that only can match the empty string (like r'\b' or r'(?<=-)') never worked, while splitting

Starting since Python 3.5 splitting on a pattern that only can match the empty string raises a ValueError (this never worked), and splitting a pattern that can match the empty string but not only emits a FutureWarning. This taken developers a time for replacing their patterns r'\s*' to r'\s+' as they should be.

Now I have created a final patch [9] that makes re.split() splitting on zero-width patterns.

>>> re.split(r'\b', 'Self-Defence Class')
['', 'Self', '-', 'Defence', ' ', 'Class', '']
>>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
['Self-', 'Defence', 'Class']
>>> re.split(r'\s*', 'Self-Defence Class')
['', 'S', 'e', 'l', 'f', '-', 'D', 'e', 'f', 'e', 'n', 'c', 'e', 'C', 'l', 'a', 's', 's', '']

The latter case the result is differ too much from the previous result, and this likely not what the author wanted to get. But users had two Python releases for fixing their code. FutureWarning is not silent by default.

Because these patterns produced errors or warnings in the recent two releases, we don't need an additional parameter for compatibility.

But the problem was not just with re.split(). Other functions also worked not good with patterns that can match the empty string.

>>> re.findall(r'^|\w+', 'Self-Defence Class')
['', 'elf', 'Defence', 'Class']
>>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 4), match='elf'>, <re.Match object; span=(5, 12), match='Defence'>, <re.Match object; span=(13, 18), match='Class'>]
>>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
'<>S<elf>-<Defence> <Class>'

After matching the empty string the following character will be skipped and will be not included in the next match. My patch fixes these functions too.

>>> re.findall(r'^|\w+', 'Self-Defence Class')
['', 'Self', 'Defence', 'Class']
>>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
[<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(0, 4), match='Self'>, <re.Match object; span=(5, 12), match='Defence'>, <re.Match object; span=(13, 18), match='Class'>]
>>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
'<><Self>-<Defence> <Class>'

I think this change don't need preliminary warnings, because it change the behavior of more rarely used patterns. No re tests have been broken. I was needed to add new tests for detecting the behavior change.

But there is one spoonful of tar in a barrel of honey. I didn't expect this, but this change have broken a pattern used with re.sub() in the doctest module: r'(?m)^\s*?$'. This was fixed by replacing it with r'(?m)^[^\S\n]+?$'). I hope that such cases are pretty rare and think this is an avoidable breakage.

The new behavior of re.split() matches the behavior of regex.split() with the VERSION1 flag, the new behavior of re.findall() and re.finditer() matches the behavior of corresponding functions in the regex module (independently from the version flag). But the new behavior of re.sub() doesn't match exactly the behavior of regex.sub() with any version flag. It differs from the old behavior as you can see in the example above, but is closer to it that to regex.sub() with VERSION1. This allowed to avoid braking existing tests for re.sub().

>>> regex.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')


'Self:Defence:Class'


>>> regex.sub(r'(?V1)(\W+|(?<=-))', r':', 'Self-Defence Class')


'Self::Defence:Class'
>>> re.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')
'Self:Defence:Class'

As re.split() it never matches the empty string adjacent to the previous match. re.findall() and re.finditer() only don't match the empty string adjacent to the previous empty string match. In the regex module regex.sub() is mutually consistent with regex.findall() and regex.finditer() (with the VERSION1 flag), but regex.split() is not consistent with them. In the re module re.split() and re.sub() will be mutually consistent, as well as re.findall() and re.finditer(). This is more backward compatible. And I don't know reasons for preferring the behavior of re.findall() and re.finditer() over the behavior of re.split() in this corner case.

Would be nice to get this change in 3.7.0a3 for wider testing. Please make a review of the patch [9] or tell your thoughts about this change.

[1] https://docs.python.org/3/library/re.html
[2] https://pypi.python.org/pypi/regex/
[3] https://mail.python.org/pipermail/python-dev/2004-August/047272.html
[4] https://bugs.python.org/issue852532
[5] https://bugs.python.org/issue988761
[6] https://bugs.python.org/issue1647489
[7] https://bugs.python.org/issue3262
[8] https://bugs.python.org/issue25054
[9] https://github.com/python/cpython/pull/4471

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org



--
--Guido van Rossum (python.org/~guido)