[Python-Dev] split('') revisited

Andrew Koenig ark@research.att.com
Thu, 1 Aug 2002 09:14:04 -0400 (EDT)

>> Section 4.2.4 of the library reference says that the 'split' method of a
>> regular expression object is defined as
>> Identical to the split() function, using the compiled pattern.

Tim> Supplying words intended to be clear from context, it's saying that the
Tim> split method of a regexp object is identical to the re.split() function,
Tim> which is true.  In much the same way, list.pop() isn't the same thing as
Tim> eyeball.pop() <wink>.

Right.  I missed the fact that there's another split.  Sorry about that.

>> My first impulse was to argue that (4) is right, and that the behavior
>> should be as follows
>> >>> 'abcde'.split('')
>> ['a', 'b', 'c', 'd', 'e']

Tim> If that's what you want, list('abcde') is a direct way to get it.

True, but that doesn't explain why it is useful to have
'abcde'.split('') and re.split('', 'abcde') behave differently.

>> I made the counterargument that one could disambiguate by adding the
>> rule that no element of the result could be equal to the delimiter.
>> Therefore, if s is a string, s.split('') cannot contain any empty
>> strings.

Tim> Sure, that's one arbitrary rule <wink>.  It doesn't seem to extend to
Tim> regexps in a reasonable way, though:

>>>> re.split('.*', 'abcde')
Tim> ['', '']

Tim> Both split pieces there match the pattern.

Yes, that's part of the source fo my confusion.

>> However, looking at the behavior of regular expression splitting more
>> closely, I become more confused.  Can someone explain the following
>> behavior to me?

>> >>> re.compile('a|(x?)').split('abracadabra')
>> ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']

>> From the docs:

Tim>     If capturing parentheses are used in pattern, then the text of all
Tim>     groups in the pattern are also returned as part of the resulting list.

OK -- as I said, I had assumed that split() was referring to the other
split function, probably because both of them were offscreen at the time.

Tim> It should also say that splits never occur at points where the only match is
Tim> against an empty string (indeed, that's exactly why re.split('', 'abcde')
Tim> doesn't split anywhere).  The logic is like:

Tim>     while True:
Tim>         find next non-empty match, else break
Tim>         emit the slice between this and the end of the last match
Tim>         emit all capturing groups
Tim>         advance position by length of match
Tim>     emit the slice from the end of the last match to the end of the string

Tim> It's the last line in the loop body that makes empty matches a wart if
Tim> allowed:  they wouldn't advance the position at all, and an infinite loop
Tim> would result.  In order to make them do what you think you want, we'd have
Tim> to add, at the end of the loop body

Tim>         ah, and if the match was emtpy, advance the position again, by,
Tim>         oh, i don't know, how about 1?  That's close to 0 <wink>.

Indeed, that's an arbitrary rule -- just about as arbitrary as the one
that you abbreviated above, which should really be

	    find the next match, but if the match is empty, disregard it;
	    instead, find the next match with a length of at least,
	    oh, I don't know, how about 1?  That's close to 0 <wink>.

What I'm trying to do is come up with a useful example to convince myself
that one is better than the other.