[Python-Dev] split('') revisited

Andrew Koenig ark@research.att.com
Wed, 31 Jul 2002 18:35:21 -0400 (EDT)


Back in February, there was a thread in comp.lang.python (and, I
think, also on Python-Dev) that asked whether the following behavior:

        >>> 'abcde'.split('')
        Traceback (most recent call last):
          File "<stdin>", line 1, in ?
        ValueError: empty separator

was a bug or a feature.  The prevailing opinion at the time seemed
to be that there was not a sensible, unique way of defining this
operation, so rejecting it was a feature.

That answer didn't bother me particularly at the time, but since then
I have learned a new fact (or perhaps an old fact that I didn't notice
at the time) that has changed my mind: Section 4.2.4 of the library
reference says that the 'split' method of a regular expression object
is defined as

        Identical to the split() function, using the compiled pattern.

This claim does not appear to be correct:

        >>> import re
        >>> re.compile('').split('abcde')
        ['abcde']

This result differs from the result of using the string split method.

In other words, the documentation doesn't match the actual behavior,
so the status quo is broken.

It seems to me that there are four reasonable courses of action:

   1) Do nothing -- the problem is too trivial to worry about.

   2) Change string split (and its documentation) to match regexp split.

   3) Change regexp split (and its documentation) to match string split.

   4) Change both string split and regexp split to do something else :-)

My first impulse was to argue that (4) is right, and that the behavior
should be as follows

        >>> 'abcde'.split('')
	['a', 'b', 'c', 'd', 'e']
        >>> import re
        >>> re.compile('').split('abcde')
	['a', 'b', 'c', 'd', 'e']

When this discussion came up last time, I think there was an objection
that s.split('') was ambiguous: What argument is there in favor of
'abcde'.split('') being ['a', 'b', 'c', 'd', 'e'] instead of, say,
['', 'a', 'b', 'c', 'd', 'e', ''] or, for that matter, ['', 'a', '',
'b', '', 'c', '', 'd', '', 'e', '']?

I made the counterargument that one could disambiguate by adding the
rule that no element of the result could be equal to the delimiter.
Therefore, if s is a string, s.split('') cannot contain any empty
strings.

However, looking at the behavior of regular expression splitting more
closely, I become more confused.  Can someone explain the following
behavior to me?

        >>> re.compile('a|(x?)').split('abracadabra') 
        ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']