On Jul 29, 2010, at 7:41 PM, MRAB wrote:
Raymond Hettinger wrote:
On Jul 29, 2010, at 5:33 PM, Greg Ewing wrote:
Alexandre Conrad wrote:
What if str.split could take an empty separator?
I propose that the semantics of str.split() never be changed. It has been around for a long time and has a complex set of behaviors that people have come to rely on. For years, we've answered arcane questions about it and have made multiple revisions to the docs in a never ending quest to precisely describe exactly what it does without just showing the C underlying code. Accordingly, existing uses depend mainly on what-it-does-as-implemented and less on the various ways it has been documented over the years. Almost any change to str.split() would either complexify the explanation of what it does or would change the behavior in a way the would break somebody's code (perhaps in a subtle ways that are hard to detect). In my opinion, str.split() should never be touched again. Instead, it may be worthwhile to develop new splitters with precise semantics aimed at specific use cases.
Does it really have a complex set of behaviours? The only (possibly) surprising behaviour for me is when it splits on whitespace (ie, passing it None as the separator). I find it very easy to understand. Or perhaps I'm just smarter than I thought! :-)
Past bug reports and newsgroup discussions covered a variety of misunderstandings:
* completely different algorithm when separator is None * behavior when separator is multiple characters (i.e. set of possible splitters vs an aggregate splitter either with or without overlaps). * behavior when maxsplit is zero * behavior when string begins or ends with whitespace * which characters count as whitespace * behavior when a string begins or ends with a split character * when runs of splitters are treated as a single splitter * behavior of a zero-length splitter * conditions under which x.join(s.split(x)) roundtrips * algorithmic difference from re.split() * are there invariants between s.count(x) and len(s.split(x)) so that you can correctly predict the number of fields returned
It was common that people thought str.split() was easy to understand until a corner case arose that defied their expectations. When the experts chimed-in, it became clear that almost no one in those discussions had a clear understanding of exactly what the implemented behaviors were and it was common to resort to experiment to disprove various incorrect hypotheses. We revised the docs several times and added a number of examples and now have a pretty good description that took years to get right.
Even now, it might be a good idea to validate the docs by seeing if someone can use the documentation text to write a pure python version of str.split() that behaves exactly like the real thing (including all corner cases).
Even if you find all of the above to be easy and intuitive, I still think it wise that we not add to complexity of str.split() with new or altered behaviors.