
Back in February, there was a thread in comp.lang.python (and, I think, also on Python-Dev) that asked whether the following behavior: >>> 'abcde'.split('') Traceback (most recent call last): File "<stdin>", line 1, in ? ValueError: empty separator was a bug or a feature. The prevailing opinion at the time seemed to be that there was not a sensible, unique way of defining this operation, so rejecting it was a feature. That answer didn't bother me particularly at the time, but since then I have learned a new fact (or perhaps an old fact that I didn't notice at the time) that has changed my mind: Section 4.2.4 of the library reference says that the 'split' method of a regular expression object is defined as Identical to the split() function, using the compiled pattern. This claim does not appear to be correct: >>> import re >>> re.compile('').split('abcde') ['abcde'] This result differs from the result of using the string split method. In other words, the documentation doesn't match the actual behavior, so the status quo is broken. It seems to me that there are four reasonable courses of action: 1) Do nothing -- the problem is too trivial to worry about. 2) Change string split (and its documentation) to match regexp split. 3) Change regexp split (and its documentation) to match string split. 4) Change both string split and regexp split to do something else :-) My first impulse was to argue that (4) is right, and that the behavior should be as follows >>> 'abcde'.split('') ['a', 'b', 'c', 'd', 'e'] >>> import re >>> re.compile('').split('abcde') ['a', 'b', 'c', 'd', 'e'] When this discussion came up last time, I think there was an objection that s.split('') was ambiguous: What argument is there in favor of 'abcde'.split('') being ['a', 'b', 'c', 'd', 'e'] instead of, say, ['', 'a', 'b', 'c', 'd', 'e', ''] or, for that matter, ['', 'a', '', 'b', '', 'c', '', 'd', '', 'e', '']? I made the counterargument that one could disambiguate by adding the rule that no element of the result could be equal to the delimiter. Therefore, if s is a string, s.split('') cannot contain any empty strings. However, looking at the behavior of regular expression splitting more closely, I become more confused. Can someone explain the following behavior to me? >>> re.compile('a|(x?)').split('abracadabra') ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']

"AK" == Andrew Koenig <ark@research.att.com> writes:
AK> Back in February, there was a thread in comp.lang.python (and, AK> I think, also on Python-Dev) that asked whether the following AK> behavior: >> 'abcde'.split('') | Traceback (most recent call last): | File "<stdin>", line 1, in ? | ValueError: empty separator AK> was a bug or a feature. The prevailing opinion at the time AK> seemed to be that there was not a sensible, unique way of AK> defining this operation, so rejecting it was a feature. AK> That answer didn't bother me particularly at the time, but AK> since then I have learned a new fact (or perhaps an old fact AK> that I didn't notice at the time) that has changed my mind: AK> Section 4.2.4 of the library reference says that the 'split' AK> method of a regular expression object is defined as AK> Identical to the split() function, using the compiled AK> pattern. AK> This claim does not appear to be correct: Actually, I believe what it's saying is that re.compile('').split('abcde') is the same as re.split('', 'abcde') not that re...split() has anything to do with the split() string method. -Barry

Andrew Koenig <ark@research.att.com> writes:
It seems to me that there are four reasonable courses of action:
1) Do nothing -- the problem is too trivial to worry about.
2) Change string split (and its documentation) to match regexp split.
3) Change regexp split (and its documentation) to match string split.
4) Change both string split and regexp split to do something else :-)
There is another option: 5) Change the documentation of re.split to match the implemented behaviour. Not that I could say what the implemented behaviour is, though :-( Regards, Martin

[Andrew Koenig]
... Section 4.2.4 of the library reference says that the 'split' method of a regular expression object is defined as
Identical to the split() function, using the compiled pattern.
Supplying words intended to be clear from context, it's saying that the split method of a regexp object is identical to the re.split() function, which is true. In much the same way, list.pop() isn't the same thing as eyeball.pop() <wink>.
This claim does not appear to be correct:
>>> import re >>> re.compile('').split('abcde') ['abcde']
This result differs from the result of using the string split method.
True, but it's the same as
import re re.split('', 'abcde') ['abcde']
which is all the docs are trying to say.
... My first impulse was to argue that (4) is right, and that the behavior should be as follows
>>> 'abcde'.split('') ['a', 'b', 'c', 'd', 'e']
If that's what you want, list('abcde') is a direct way to get it.
... I made the counterargument that one could disambiguate by adding the rule that no element of the result could be equal to the delimiter. Therefore, if s is a string, s.split('') cannot contain any empty strings.
Sure, that's one arbitrary rule <wink>. It doesn't seem to extend to regexps in a reasonable way, though:
re.split('.*', 'abcde') ['', '']
Both split pieces there match the pattern.
However, looking at the behavior of regular expression splitting more closely, I become more confused. Can someone explain the following behavior to me?
>>> re.compile('a|(x?)').split('abracadabra') ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']

Section 4.2.4 of the library reference says that the 'split' method of a regular expression object is defined as
Identical to the split() function, using the compiled pattern.
Tim> Supplying words intended to be clear from context, it's saying that the Tim> split method of a regexp object is identical to the re.split() function, Tim> which is true. In much the same way, list.pop() isn't the same thing as Tim> eyeball.pop() <wink>. Right. I missed the fact that there's another split. Sorry about that.
My first impulse was to argue that (4) is right, and that the behavior should be as follows
'abcde'.split('') ['a', 'b', 'c', 'd', 'e']
Tim> If that's what you want, list('abcde') is a direct way to get it. True, but that doesn't explain why it is useful to have 'abcde'.split('') and re.split('', 'abcde') behave differently.
I made the counterargument that one could disambiguate by adding the rule that no element of the result could be equal to the delimiter. Therefore, if s is a string, s.split('') cannot contain any empty strings.
Tim> Sure, that's one arbitrary rule <wink>. It doesn't seem to extend to Tim> regexps in a reasonable way, though:
re.split('.*', 'abcde') Tim> ['', '']
Tim> Both split pieces there match the pattern. Yes, that's part of the source fo my confusion.
However, looking at the behavior of regular expression splitting more closely, I become more confused. Can someone explain the following behavior to me?
re.compile('a|(x?)').split('abracadabra') ['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']
From the docs:
Tim> If capturing parentheses are used in pattern, then the text of all Tim> groups in the pattern are also returned as part of the resulting list. OK -- as I said, I had assumed that split() was referring to the other split function, probably because both of them were offscreen at the time. Tim> It should also say that splits never occur at points where the only match is Tim> against an empty string (indeed, that's exactly why re.split('', 'abcde') Tim> doesn't split anywhere). The logic is like: Tim> while True: Tim> find next non-empty match, else break Tim> emit the slice between this and the end of the last match Tim> emit all capturing groups Tim> advance position by length of match Tim> emit the slice from the end of the last match to the end of the string Tim> It's the last line in the loop body that makes empty matches a wart if Tim> allowed: they wouldn't advance the position at all, and an infinite loop Tim> would result. In order to make them do what you think you want, we'd have Tim> to add, at the end of the loop body Tim> ah, and if the match was emtpy, advance the position again, by, Tim> oh, i don't know, how about 1? That's close to 0 <wink>. Indeed, that's an arbitrary rule -- just about as arbitrary as the one that you abbreviated above, which should really be find the next match, but if the match is empty, disregard it; instead, find the next match with a length of at least, oh, I don't know, how about 1? That's close to 0 <wink>. What I'm trying to do is come up with a useful example to convince myself that one is better than the other.

... [Tim]
It's the last line in the loop body that makes empty matches a wart if allowed: they wouldn't advance the position at all, and an infinite loop would result. In order to make them do what you think you want, we'd have to add, at the end of the loop body
ah, and if the match was emtpy, advance the position again, by, oh, i don't know, how about 1? That's close to 0 <wink>.
[Andrew Koenig]
Indeed, that's an arbitrary rule -- just about as arbitrary as the one that you abbreviated above, which should really be
find the next match, but if the match is empty, disregard it; instead, find the next match with a length of at least, oh, I don't know, how about 1? That's close to 0 <wink>.
You really think so? I expect almost all programmers would understand what "find next non-empty match" means at first glance -- and especially regexp-slingers, who are often burned in their matching lives by the consequences of having large pieces of their patterns unexpectedly match an empty string. That makes "non-empty match" seem a natural concept to me.
What I'm trying to do is come up with a useful example to convince myself that one is better than the other.
Have you found one yet? I confess that re.findall() implements a "if the match was empty, advance the position by 1" rule, as in
re.findall("x?", "abc") ['', '', '', '']
But I don't think we're doing anyone a favor with stuff like that. I think it's a dubious idea that
"abc".find('') 0
"works" too. If a program does s1.find(s2) and s2 is an empty string, I expect the chances are good it's a logic error in the program. Analogies to, e.g., i+j when j happens to be 0 leave me cold, since I can think of a thousand reasons for why j might naturally be 0. But I've had a hard time thinking of a reasonable algorithm where the expression s1.find(s2) could be expected to have s2=="" in normal operation (and am sure it would have been a logic error elsewhere in any uses of string.find() I've made; ditto searching for, or splitting on, empty strings via regexps).
participants (4)
-
Andrew Koenig
-
barry@python.org
-
martin@v.loewis.de
-
Tim Peters