[Python-Dev] Zero-width matching in regexes

Wed Dec 6 08:37:47 EST 2017

On 6 December 2017 at 13:13, Serhiy Storchaka <storchaka at gmail.com> wrote:
> 05.12.17 22:26, Terry Reedy пише:
>>
>> On 12/4/2017 6:21 PM, MRAB wrote:
>>>
>>> I've finally come to a conclusion as to what the "correct" behaviour of
>>> zero-width matches should be: """always return the first match, but never a
>>> zero-width match that is joined to a previous zero-width match""".
>>
>>
>> Is this different from current re or regex?
>
>
> Partially. There are different ways of handling the problem of repeated
> zero-width searching.
>
> 1. The one formulated by Matthew. This is the behavior of findall() and
> finditer() in regex in both VERSION0 and VERSION1 modes, sub() in regex in
> the VERSION1 mode, and findall() and finditer() in re since 3.7.
>
> 2. Prohibit a zero-width match that is joined to a previous match
> (independent from its width). This is the behavior of sub() in re and in
> regex in the VERSION0 mode, and split() in regex in the VERSION1 mode. This
> is the only correctly documented and explicitly tested behavior in re.
>
> 3. Prohibit a zero-width match (always). This is the behavior of split() in
> re in 3.4 and older (deprecated since 3.5) and in regex in VERSION0 mode.
>
> 4. Exclude the character following a zero-width match from following
> matches. This is the behavior of findall() and finditer() in 3.6 and older.
>
> The case 4 is definitely incorrect. It leads to excluding characters from
> matching. re.findall(r'^|\w+', 'two words') returns ['', 'wo', 'words'].
>
> The case 3 is pretty useless. It disallow splitting on useful zero-width
> patterns like `\b` and makes `\s*` just equal to `\s+`.
>
> The difference between cases 1 and 2 is subtle. The case 1 looks more
> logical and matches the behavior of Perl and PCRE, but the case 2 is
> explicitly documented and tested. This behavior is kept for compatibility
> with an ancient re implementation.

Behaviour (1) means that we'd get

>>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION1)
'xx xx'

(because \w* matches the empty string after each word, as well as each
word itself). I just tested in Perl, and that is indeed what happens
there as well.

On that basis, I have to say that I find behaviour (2) more intuitive
and (arguably) "correct":

>>> regex.sub(r'\w*', 'x', 'hello world', flags=regex.VERSION0)
'x x'
>>> re.sub(r'\w*', 'x', 'hello world')
'x x'

Paul