[Python-ideas] str.split with empty separator

Fri Jul 30 06:51:24 CEST 2010

On Jul 29, 2010, at 7:41 PM, MRAB wrote:

> Raymond Hettinger wrote:
>> On Jul 29, 2010, at 5:33 PM, Greg Ewing wrote:
>>> Alexandre Conrad wrote:
>>> 
>>>> What if str.split could take an empty separator?
>> I propose that the semantics of str.split() never be changed.
>> It has been around for a long time and has a complex set of behaviors that people have come to rely on.  For years, we've answered arcane questions about it and have made multiple revisions to the docs in a
>> never ending quest to precisely describe exactly what it does without just showing the C underlying code.  Accordingly, existing uses depend
>> mainly on what-it-does-as-implemented and less on the various ways
>> it has been documented over the years.  Almost any change to str.split() would either complexify the explanation
>> of what it does or would change the behavior in a way the would
>> break somebody's code (perhaps in a subtle ways that are hard to detect).
>> In my opinion, str.split() should never be touched again.  Instead, it may be worthwhile to develop new splitters with precise semantics aimed at specific use cases.
> Does it really have a complex set of behaviours? The only (possibly)
> surprising behaviour for me is when it splits on whitespace (ie, passing
> it None as the separator). I find it very easy to understand. Or perhaps
> I'm just smarter than I thought! :-)

Past bug reports and newsgroup discussions covered
a variety of misunderstandings:

* completely different algorithm when separator is None
* behavior when separator is multiple characters
  (i.e. set of possible splitters vs an aggregate splitter
   either with or without overlaps).
* behavior when maxsplit is zero
* behavior when string begins or ends with whitespace
* which characters count as whitespace
* behavior when a string begins or ends with a split character
* when runs of splitters are treated as a single splitter
* behavior of a zero-length splitter
* conditions under which  x.join(s.split(x)) roundtrips
* algorithmic difference from re.split()
* are there invariants between s.count(x) and len(s.split(x))
   so that you can correctly predict the number of fields returned 

It was common that people thought str.split() was easy to understand
until a corner case arose that defied their expectations.  When
the experts chimed-in, it became clear that almost no one in 
those discussions had a clear understanding of exactly what 
the implemented behaviors were and it was common to resort
to experiment to disprove various incorrect hypotheses.
We revised the docs several times and added a number of
examples and now have a pretty good description that took
years to get right.   

Even now, it might be a good idea to validate the docs by
seeing if someone can use the documentation text to write 
a pure python version of str.split() that behaves exactly like the
real thing (including all corner cases).

Even if you find all of the above to be easy and intuitive,
I still think it wise that we not add to complexity of str.split()
with new or altered behaviors.

Raymond