Hello all, What if str.split could take an empty separator?
'banana'.split('') ['b', 'a', 'n', 'a', 'n', 'a']
I know this can be done with:
list('banana') ['b', 'a', 'n', 'a', 'n', 'a']
I think that, semantically speaking, it would make sens to split where there are no characters (in between them). Right now you can join from an empty string: ''.join(['b', 'a', 'n', 'a', 'n', 'a']) So why can't we split from an empty string? This wouldn't introduce any backwards incompatible changes as str.split currently can't have an empty separator:
'banana'.split('') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: empty separator
I would love to see my banana actually split. :) Regards, -- Alex twitter.com/alexconrad
Alexandre Conrad wrote:
Hello all,
What if str.split could take an empty separator?
'banana'.split('') ['b', 'a', 'n', 'a', 'n', 'a']
I know this can be done with:
list('banana') ['b', 'a', 'n', 'a', 'n', 'a']
I think that, semantically speaking, it would make sens to split where there are no characters (in between them). Right now you can join from an empty string:
''.join(['b', 'a', 'n', 'a', 'n', 'a'])
So why can't we split from an empty string?
This wouldn't introduce any backwards incompatible changes as str.split currently can't have an empty separator:
'banana'.split('') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: empty separator
I would love to see my banana actually split. :)
Shouldn't it be this:
'banana'.split('') ['', 'b', 'a', 'n', 'a', 'n', 'a', '']
After all, the separator does exist at the start and end of the string:
'banana'.startswith('') True 'banana'.endswith('') True
Alexandre Conrad wrote:
What if str.split could take an empty separator?
Do you have a use case for this?
Right now you can join from an empty string... So why can't we split from an empty string?
Because splitting on an empty string is ambiguous, and nobody has so far put forward a compelling use case that would show how the ambiguity should best be resolved. -- Greg
On Jul 29, 2010, at 5:33 PM, Greg Ewing wrote:
Alexandre Conrad wrote:
What if str.split could take an empty separator?
I propose that the semantics of str.split() never be changed. It has been around for a long time and has a complex set of behaviors that people have come to rely on. For years, we've answered arcane questions about it and have made multiple revisions to the docs in a never ending quest to precisely describe exactly what it does without just showing the C underlying code. Accordingly, existing uses depend mainly on what-it-does-as-implemented and less on the various ways it has been documented over the years. Almost any change to str.split() would either complexify the explanation of what it does or would change the behavior in a way the would break somebody's code (perhaps in a subtle ways that are hard to detect). In my opinion, str.split() should never be touched again. Instead, it may be worthwhile to develop new splitters with precise semantics aimed at specific use cases. Raymond
Raymond Hettinger wrote:
On Jul 29, 2010, at 5:33 PM, Greg Ewing wrote:
Alexandre Conrad wrote:
What if str.split could take an empty separator?
I propose that the semantics of str.split() never be changed.
It has been around for a long time and has a complex set of behaviors that people have come to rely on. For years, we've answered arcane questions about it and have made multiple revisions to the docs in a never ending quest to precisely describe exactly what it does without just showing the C underlying code. Accordingly, existing uses depend mainly on what-it-does-as-implemented and less on the various ways it has been documented over the years.
Almost any change to str.split() would either complexify the explanation of what it does or would change the behavior in a way the would break somebody's code (perhaps in a subtle ways that are hard to detect).
In my opinion, str.split() should never be touched again. Instead, it may be worthwhile to develop new splitters with precise semantics aimed at specific use cases.
Does it really have a complex set of behaviours? The only (possibly) surprising behaviour for me is when it splits on whitespace (ie, passing it None as the separator). I find it very easy to understand. Or perhaps I'm just smarter than I thought! :-)
On 30/07/10 14:41, MRAB wrote:
Does it really have a complex set of behaviours?
I think Raymond may be referring to the fact that the behaviour of split() with and without a splitting string differs in subtle ways with certain edge cases. It's almost better thought of as two different functions that happen to share a name. -- Greg
On Jul 29, 2010, at 7:41 PM, MRAB wrote:
Raymond Hettinger wrote:
Alexandre Conrad wrote:
What if str.split could take an empty separator? I propose that the semantics of str.split() never be changed. It has been around for a long time and has a complex set of behaviors that people have come to rely on. For years, we've answered arcane questions about it and have made multiple revisions to the docs in a never ending quest to precisely describe exactly what it does without just showing the C underlying code. Accordingly, existing uses depend
On Jul 29, 2010, at 5:33 PM, Greg Ewing wrote: mainly on what-it-does-as-implemented and less on the various ways it has been documented over the years. Almost any change to str.split() would either complexify the explanation of what it does or would change the behavior in a way the would break somebody's code (perhaps in a subtle ways that are hard to detect). In my opinion, str.split() should never be touched again. Instead, it may be worthwhile to develop new splitters with precise semantics aimed at specific use cases. Does it really have a complex set of behaviours? The only (possibly) surprising behaviour for me is when it splits on whitespace (ie, passing it None as the separator). I find it very easy to understand. Or perhaps I'm just smarter than I thought! :-)
Past bug reports and newsgroup discussions covered a variety of misunderstandings: * completely different algorithm when separator is None * behavior when separator is multiple characters (i.e. set of possible splitters vs an aggregate splitter either with or without overlaps). * behavior when maxsplit is zero * behavior when string begins or ends with whitespace * which characters count as whitespace * behavior when a string begins or ends with a split character * when runs of splitters are treated as a single splitter * behavior of a zero-length splitter * conditions under which x.join(s.split(x)) roundtrips * algorithmic difference from re.split() * are there invariants between s.count(x) and len(s.split(x)) so that you can correctly predict the number of fields returned It was common that people thought str.split() was easy to understand until a corner case arose that defied their expectations. When the experts chimed-in, it became clear that almost no one in those discussions had a clear understanding of exactly what the implemented behaviors were and it was common to resort to experiment to disprove various incorrect hypotheses. We revised the docs several times and added a number of examples and now have a pretty good description that took years to get right. Even now, it might be a good idea to validate the docs by seeing if someone can use the documentation text to write a pure python version of str.split() that behaves exactly like the real thing (including all corner cases). Even if you find all of the above to be easy and intuitive, I still think it wise that we not add to complexity of str.split() with new or altered behaviors. Raymond
participants (4)
-
Alexandre Conrad
-
Greg Ewing
-
MRAB
-
Raymond Hettinger