[Python-Dev] split('') revisited
Andrew Koenig
ark@research.att.com
Thu, 1 Aug 2002 09:14:04 -0400 (EDT)
>> Section 4.2.4 of the library reference says that the 'split' method of a
>> regular expression object is defined as
>>
>> Identical to the split() function, using the compiled pattern.
Tim> Supplying words intended to be clear from context, it's saying that the
Tim> split method of a regexp object is identical to the re.split() function,
Tim> which is true. In much the same way, list.pop() isn't the same thing as
Tim> eyeball.pop() <wink>.
Right. I missed the fact that there's another split. Sorry about that.
>> My first impulse was to argue that (4) is right, and that the behavior
>> should be as follows
>>
>> >>> 'abcde'.split('')
>> ['a', 'b', 'c', 'd', 'e']
Tim> If that's what you want, list('abcde') is a direct way to get it.
True, but that doesn't explain why it is useful to have
'abcde'.split('') and re.split('', 'abcde') behave differently.
>> I made the counterargument that one could disambiguate by adding the
>> rule that no element of the result could be equal to the delimiter.
>> Therefore, if s is a string, s.split('') cannot contain any empty
>> strings.
Tim> Sure, that's one arbitrary rule <wink>. It doesn't seem to extend to
Tim> regexps in a reasonable way, though:
>>>> re.split('.*', 'abcde')
Tim> ['', '']
Tim> Both split pieces there match the pattern.
Yes, that's part of the source fo my confusion.
>> However, looking at the behavior of regular expression splitting more
>> closely, I become more confused. Can someone explain the following
>> behavior to me?
>> >>> re.compile('a|(x?)').split('abracadabra')
>> ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']
>> From the docs:
Tim> If capturing parentheses are used in pattern, then the text of all
Tim> groups in the pattern are also returned as part of the resulting list.
OK -- as I said, I had assumed that split() was referring to the other
split function, probably because both of them were offscreen at the time.
Tim> It should also say that splits never occur at points where the only match is
Tim> against an empty string (indeed, that's exactly why re.split('', 'abcde')
Tim> doesn't split anywhere). The logic is like:
Tim> while True:
Tim> find next non-empty match, else break
Tim> emit the slice between this and the end of the last match
Tim> emit all capturing groups
Tim> advance position by length of match
Tim> emit the slice from the end of the last match to the end of the string
Tim> It's the last line in the loop body that makes empty matches a wart if
Tim> allowed: they wouldn't advance the position at all, and an infinite loop
Tim> would result. In order to make them do what you think you want, we'd have
Tim> to add, at the end of the loop body
Tim> ah, and if the match was emtpy, advance the position again, by,
Tim> oh, i don't know, how about 1? That's close to 0 <wink>.
Indeed, that's an arbitrary rule -- just about as arbitrary as the one
that you abbreviated above, which should really be
find the next match, but if the match is empty, disregard it;
instead, find the next match with a length of at least,
oh, I don't know, how about 1? That's close to 0 <wink>.
What I'm trying to do is come up with a useful example to convince myself
that one is better than the other.