[Python-Dev] split('') revisited

Tim Peters tim.one@comcast.net
Thu, 01 Aug 2002 02:39:24 -0400

[Andrew Koenig]
> ...
> Section 4.2.4 of the library reference says that the 'split' method of a
> regular expression object is defined as
>         Identical to the split() function, using the compiled pattern.

Supplying words intended to be clear from context, it's saying that the
split method of a regexp object is identical to the re.split() function,
which is true.  In much the same way, list.pop() isn't the same thing as
eyeball.pop() <wink>.

> This claim does not appear to be correct:
>         >>> import re
>         >>> re.compile('').split('abcde')
>         ['abcde']
> This result differs from the result of using the string split method.

True, but it's the same as

>>> import re
>>> re.split('', 'abcde')

which is all the docs are trying to say.

> ...
> My first impulse was to argue that (4) is right, and that the behavior
> should be as follows
>         >>> 'abcde'.split('')
> 	['a', 'b', 'c', 'd', 'e']

If that's what you want, list('abcde') is a direct way to get it.

> ...
> I made the counterargument that one could disambiguate by adding the
> rule that no element of the result could be equal to the delimiter.
> Therefore, if s is a string, s.split('') cannot contain any empty
> strings.

Sure, that's one arbitrary rule <wink>.  It doesn't seem to extend to
regexps in a reasonable way, though:

>>> re.split('.*', 'abcde')
['', '']

Both split pieces there match the pattern.

> However, looking at the behavior of regular expression splitting more
> closely, I become more confused.  Can someone explain the following
> behavior to me?
>         >>> re.compile('a|(x?)').split('abracadabra')
>         ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']

>From the docs:

    If capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting list.

It should also say that splits never occur at points where the only match is
against an empty string (indeed, that's exactly why re.split('', 'abcde')
doesn't split anywhere).  The logic is like:

    while True:
        find next non-empty match, else break
        emit the slice between this and the end of the last match
        emit all capturing groups
        advance position by length of match
    emit the slice from the end of the last match to the end of the string

It's the last line in the loop body that makes empty matches a wart if
allowed:  they wouldn't advance the position at all, and an infinite loop
would result.  In order to make them do what you think you want, we'd have
to add, at the end of the loop body

        ah, and if the match was emtpy, advance the position again, by,
        oh, i don't know, how about 1?  That's close to 0 <wink>.

So the pattern matches at the first 'a', and adds '' to the list (the slice
to the left of the first match) and None to the list (the capturing group
didn't participate in the match, but that doesn't excuse it from adding
something to the list).  There are no other non-empty matches until getting
to the second 'a', and then that adds 'br' to the list (the slice between
the current match and the last match), and None again for the
non-participating capturing group.  Etc.  The trailing empty string is the
slice from the end of the last match to the end of the string (which happens
to be empty in this case).

It's unclear to me what you expected instead.  Perhaps this?

>>> re.split('a|(?:x?)', 'abracadabra')
['', 'br', 'c', 'd', 'br', '']