[Python-Dev] Regular expressions: splitting on zero-width patterns

MRAB python at mrabarnett.plus.com
Tue Nov 28 15:42:58 EST 2017


On 2017-11-28 20:04, Serhiy Storchaka wrote:
> The two largest problems in the re module are splitting on zero-width
> patterns and complete and correct support of the Unicode standard. These
> problems are solved in regex. regex has many other features, but they
> are less important.
> 
> I want to tell the problem of splitting on zero-width patterns. It
> already was discussed on Python-Dev 13 years ago [3] and maybe later.
> See also issues: [4], [5], [6], [7], [8].
> 
> In short it doesn't work. Splitting on the pattern r'\b' doesn't split
> the text at boundaries of words, and splitting on the pattern
> r'\s+|(?<=-)' will split the text on whitespaces, but will not split
> words with hypens as expected.
> 
> In Python 3.4 and earlier:
> 
>   >>> re.split(r'\b', 'Self-Defence Class')
> ['Self-Defence Class']
>   >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
> ['Self-Defence', 'Class']
>   >>> re.split(r'\s*', 'Self-Defence Class')
> ['Self-Defence', 'Class']
> 
> Note that splitting on r'\s*' (0 or more whitespaces) actually split on
> r'\s+' (1 or more whitespaces). Splitting on patterns that only can
> match the empty string (like r'\b' or r'(?<=-)') never worked, while
> splitting
> 
> Starting since Python 3.5 splitting on a pattern that only can match the
> empty string raises a ValueError (this never worked), and splitting a
> pattern that can match the empty string but not only emits a
> FutureWarning. This taken developers a time for replacing their patterns
> r'\s*' to r'\s+' as they should be.
> 
> Now I have created a final patch [9] that makes re.split() splitting on
> zero-width patterns.
> 
>   >>> re.split(r'\b', 'Self-Defence Class')
> ['', 'Self', '-', 'Defence', ' ', 'Class', '']
>   >>> re.split(r'\s+|(?<=-)', 'Self-Defence Class')
> ['Self-', 'Defence', 'Class']
>   >>> re.split(r'\s*', 'Self-Defence Class')
> ['', 'S', 'e', 'l', 'f', '-', 'D', 'e', 'f', 'e', 'n', 'c', 'e', 'C',
> 'l', 'a', 's', 's', '']
> 
> The latter case the result is differ too much from the previous result,
> and this likely not what the author wanted to get. But users had two
> Python releases for fixing their code. FutureWarning is not silent by
> default.
> 
> Because these patterns produced errors or warnings in the recent two
> releases, we don't need an additional parameter for compatibility.
> 
> But the problem was not just with re.split(). Other functions also
> worked not good with patterns that can match the empty string.
> 
>   >>> re.findall(r'^|\w+', 'Self-Defence Class')
> ['', 'elf', 'Defence', 'Class']
>   >>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
> [<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(1,
> 4), match='elf'>, <re.Match object; span=(5, 12), match='Defence'>,
> <re.Match object; span=(13, 18), match='Class'>]
>   >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
> '<>S<elf>-<Defence> <Class>'
> 
> After matching the empty string the following character will be skipped
> and will be not included in the next match. My patch fixes these
> functions too.
> 
>   >>> re.findall(r'^|\w+', 'Self-Defence Class')
> ['', 'Self', 'Defence', 'Class']
>   >>> list(re.finditer(r'^|\w+', 'Self-Defence Class'))
> [<re.Match object; span=(0, 0), match=''>, <re.Match object; span=(0,
> 4), match='Self'>, <re.Match object; span=(5, 12), match='Defence'>,
> <re.Match object; span=(13, 18), match='Class'>]
>   >>> re.sub(r'(^|\w+)', r'<\1>', 'Self-Defence Class')
> '<><Self>-<Defence> <Class>'
> 
> I think this change don't need preliminary warnings, because it change
> the behavior of more rarely used patterns. No re tests have been broken.
> I was needed to add new tests for detecting the behavior change.
> 
> But there is one spoonful of tar in a barrel of honey. I didn't expect
> this, but this change have broken a pattern used with re.sub() in the
> doctest module: r'(?m)^\s*?$'. This was fixed by replacing it with
> r'(?m)^[^\S\n]+?$'). I hope that such cases are pretty rare and think
> this is an avoidable breakage.
> 
> The new behavior of re.split() matches the behavior of regex.split()
> with the VERSION1 flag, the new behavior of re.findall() and
> re.finditer() matches the behavior of corresponding functions in the
> regex module (independently from the version flag). But the new behavior
> of re.sub() doesn't match exactly the behavior of regex.sub() with any
> version flag. It differs from the old behavior as you can see in the
> example above, but is closer to it that to regex.sub() with VERSION1.
> This allowed to avoid braking existing tests for re.sub().
> 
>   >>> regex.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')
>   
>   
> 
> 'Self:Defence:Class'
>   
>   
> 
>   >>> regex.sub(r'(?V1)(\W+|(?<=-))', r':', 'Self-Defence Class')
>   
>   
> 
> 'Self::Defence:Class'
>   >>> re.sub(r'(\W+|(?<=-))', r':', 'Self-Defence Class')
> 'Self:Defence:Class'
> 
> As re.split() it never matches the empty string adjacent to the previous
> match. re.findall() and re.finditer() only don't match the empty string
> adjacent to the previous empty string match. In the regex module
> regex.sub() is mutually consistent with regex.findall() and
> regex.finditer() (with the VERSION1 flag), but regex.split() is not
> consistent with them. In the re module re.split() and re.sub() will be
> mutually consistent, as well as re.findall() and re.finditer(). This is
> more backward compatible. And I don't know reasons for preferring the
> behavior of re.findall() and re.finditer() over the behavior of
> re.split() in this corner case.
> 
FTR, you could make an argument for either behaviour. For regex, I went 
with what Perl does.

> Would be nice to get this change in 3.7.0a3 for wider testing. Please
> make a review of the patch [9] or tell your thoughts about this change.
> 
> [1] https://docs.python.org/3/library/re.html
> [2] https://pypi.python.org/pypi/regex/
> [3] https://mail.python.org/pipermail/python-dev/2004-August/047272.html
> [4] https://bugs.python.org/issue852532
> [5] https://bugs.python.org/issue988761
> [6] https://bugs.python.org/issue1647489
> [7] https://bugs.python.org/issue3262
> [8] https://bugs.python.org/issue25054
> [9] https://github.com/python/cpython/pull/4471
> 


More information about the Python-Dev mailing list