[Python-Dev] Regular expressions: splitting on zero-width patterns

Tue Nov 28 15:15:31 EST 2017

I trust your instincts and powers of analysis here. Maybe MRAB has some
useful feedback on the tar in the honey?

On Tue, Nov 28, 2017 at 12:04 PM, Serhiy Storchaka <storchaka at gmail.com>
wrote:

> The two largest problems in the re module are splitting on zero-width
> patterns and complete and correct support of the Unicode standard. These
> problems are solved in regex. regex has many other features, but they are
> less important.
>
> I want to tell the problem of splitting on zero-width patterns. It already
> was discussed on Python-Dev 13 years ago [3] and maybe later. See also
> issues: [4], [5], [6], [7], [8].
>
> In short it doesn't work. Splitting on the pattern r'\b' doesn't split the
> text at boundaries of words, and splitting on the pattern r'\s+|(?<=-)'
> will split the text on whitespaces, but will not split words with hypens as
> expected.
>
> In Python 3.4 and earlier:
>
> >>> re.split(r'\b', 'Self-Defence Class')
> ['Self-Defence Class']
> >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
> ['Self-Defence', 'Class']
> >>> re.split(r'\s*', 'Self-Defence Class')
> ['Self-Defence', 'Class']
>
> Note that splitting on r'\s*' (0 or more whitespaces) actually split on
> r'\s+' (1 or more whitespaces). Splitting on patterns that only can match
> the empty string (like r'\b' or r'(?<=-)') never worked, while splitting
>
> Starting since Python 3.5 splitting on a pattern that only can match the
> empty string raises a ValueError (this never worked), and splitting a
> pattern that can match the empty string but not only emits a FutureWarning.
> This taken developers a time for replacing their patterns r'\s*' to r'\s+'
> as they should be.
>
> Now I have created a final patch [9] that makes re.split() splitting on
> zero-width patterns.
>
> >>> re.split(r'\b', 'Self-Defence Class')
> ['', 'Self', '-', 'Defence', ' ', 'Class', '']
> >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
> ['Self-', 'Defence', 'Class']
> >>> re.split(r'\s*', 'Self-Defence Class')
> ['', 'S', 'e', 'l', 'f', '-', 'D', 'e', 'f', 'e', 'n', 'c', 'e', 'C', 'l',
> 'a', 's', 's', '']
>
> The latter case the result is differ too much from the previous result,
> and this likely not what the author wanted to get. But users had two Python
> releases for fixing their code. FutureWarning is not silent by default.
>
> Because these patterns produced errors or warnings in the recent two
> releases, we don't need an additional parameter for compatibility.
>
> But the problem was not just with re.split(). Other functions also worked
> not good with patterns that can match the empty string.
>
> >>> re.findall(r'^|\w+', 'Self-Defence Class')
> ['', 'elf', 'Defence', 'Class']
> >>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
> [<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1, 4),
> match='elf'>, <re.Match object; span=(5, 12), match='Defence'>, <re.Match
> object; span=(13, 18), match='Class'>]
> >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
> '<>S<elf>-<Defence> <Class>'
>
> After matching the empty string the following character will be skipped
> and will be not included in the next match. My patch fixes these functions
> too.
>
> >>> re.findall(r'^|\w+', 'Self-Defence Class')
> ['', 'Self', 'Defence', 'Class']
> >>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
> [<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(0, 4),
> match='Self'>, <re.Match object; span=(5, 12), match='Defence'>, <re.Match
> object; span=(13, 18), match='Class'>]
> >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
> '<><Self>-<Defence> <Class>'
>
> I think this change don't need preliminary warnings, because it change the
> behavior of more rarely used patterns. No re tests have been broken. I was
> needed to add new tests for detecting the behavior change.
>
> But there is one spoonful of tar in a barrel of honey. I didn't expect
> this, but this change have broken a pattern used with re.sub() in the
> doctest module: r'(?m)^\s*?$'. This was fixed by replacing it with
> r'(?m)^[^\S\n]+?$'). I hope that such cases are pretty rare and think this
> is an avoidable breakage.
>
> The new behavior of re.split() matches the behavior of regex.split() with
> the VERSION1 flag, the new behavior of re.findall() and re.finditer()
> matches the behavior of corresponding functions in the regex module
> (independently from the version flag). But the new behavior of re.sub()
> doesn't match exactly the behavior of regex.sub() with any version flag. It
> differs from the old behavior as you can see in the example above, but is
> closer to it that to regex.sub() with VERSION1. This allowed to avoid
> braking existing tests for re.sub().
>
> >>> regex.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')
>
>
> 'Self:Defence:Class'
>
>
> >>> regex.sub(r'(?V1)(\W+|(?<=-))', r':', 'Self-Defence Class')
>
>
> 'Self::Defence:Class'
> >>> re.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')
> 'Self:Defence:Class'
>
> As re.split() it never matches the empty string adjacent to the previous
> match. re.findall() and re.finditer() only don't match the empty string
> adjacent to the previous empty string match. In the regex module
> regex.sub() is mutually consistent with regex.findall() and
> regex.finditer() (with the VERSION1 flag), but regex.split() is not
> consistent with them. In the re module re.split() and re.sub() will be
> mutually consistent, as well as re.findall() and re.finditer(). This is
> more backward compatible. And I don't know reasons for preferring the
> behavior of re.findall() and re.finditer() over the behavior of re.split()
> in this corner case.
>
> Would be nice to get this change in 3.7.0a3 for wider testing. Please make
> a review of the patch [9] or tell your thoughts about this change.
>
> [1] https://docs.python.org/3/library/re.html
> [2] https://pypi.python.org/pypi/regex/
> [3] https://mail.python.org/pipermail/python-dev/2004-August/047272.html
> [4] https://bugs.python.org/issue852532
> [5] https://bugs.python.org/issue988761
> [6] https://bugs.python.org/issue1647489
> [7] https://bugs.python.org/issue3262
> [8] https://bugs.python.org/issue25054
> [9] https://github.com/python/cpython/pull/4471
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%
> 40python.org
>

-- 
--Guido van Rossum (python.org/~guido)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20171128/c42767fa/attachment.html>