[issue15554] correct and clarify str.splitlines() documentation

New submission from Chris Jerdonek: The documentation for str.splitlines()-- http://docs.python.org/dev/library/stdtypes.html#str.splitlines includes a statement that is not quite correct: "Unlike split(), if the string ends with line boundary characters the returned list does not have an empty last element." For example,
'\n'.splitlines() [''] '\n\n'.splitlines() ['', ''] '\r\n'.splitlines() [''] '\n\r\n'.splitlines() ['', ''] '\r'.splitlines() [''] 'a\n\n'.splitlines() ['a', '']
Also, the note about split() only applies when split() is passed a separator. For example--
'a\n'.split('\n') ['a', ''] 'a\n'.split() ['a']
Finally, the function's behavior on the empty string is another difference worth mentioning that is not covered by the existing note. I am attaching a patch that addresses these points. Notice also that the patch phrases it not as whether the list *has* an empty last element, but whether an *additional* last element should be added, which is the more important point. ---------- assignee: docs@python components: Documentation files: issue-splitlines-docs-1.patch keywords: easy, patch messages: 167394 nosy: cjerdonek, docs@python, jcea, pitrou priority: normal severity: normal stage: patch review status: open title: correct and clarify str.splitlines() documentation versions: Python 2.7, Python 3.2, Python 3.3 Added file: http://bugs.python.org/file26680/issue-splitlines-docs-1.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

R. David Murray added the comment: Sigh. ;) At this point in my Python programming I intuitively understand what splitlines does, but every time we try to explain it in detail it gets messier and messier. I wasn't really happy with the addition of that sentence about split in the first place. I don't understand what your splitlines examples are trying to say, they all look clear to me based on the fact that we are splitting *lines*. I don't find your proposed language in the patch to be clearer. The existing sentence describes the concrete behavior, while your version is sort-of describing (ascribing?) some syntax to the line separators ("does not delimit"). The problem is that there *is* a syntax here, that of universal-newline-delimited-text, but that is too big a topic to explain in the splitlines doc. There's another issue for creating a central description of universal-newline parsing, perhaps this entry could link to that discussion (and that discussion could perhaps mention splitlines). The split behavior without a specified separator might actually be a bug (if so, it is not a fixable one), but in any case you are right that that clarification should be added if the existing sentence is kept. ---------- nosy: +ncoghlan, r.david.murray _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

Chris Jerdonek added the comment:
I wasn't really happy with the addition of that sentence about split in the first place.
I think the instinct to put that sentence in there is a good one. It is a key, perhaps subtle difference.
I don't understand what your splitlines examples are trying to say, they all look clear to me based on the fact that we are splitting *lines*.
I perhaps included too many examples and so clouded my point. :) I just needed one. The examples were simply to show why the existing language is not correct. The current language says, "if the string ends with line boundary characters the returned list does not have an empty last element." However, the examples are of strings that do end with line boundary characters but that *do* have an empty last element. The point is that splitlines() does not count a terminal line break as an additional line, while split('\n') (for example) does. But this is different from whether the returned list *has* an empty last element, which is what the current language says. The returned list can have empty last elements because of line breaks at the end. It's just that the one at the *very* end doesn't count towards that -- unlike the case for split():
'a'.splitlines() ['a'] 'a\n'.splitlines() ['a'] 'a\n\n'.splitlines() ['a', ''] 'a\n\n\n'.splitlines() ['a', '', ''] 'a\n\n\n'.split('\n') # counts terminal line break as an extra line ['a', '', '', '']
I'm open to improving the language. Maybe "does not count a terminal line break as an additional line" instead of the original "a terminal line break does not delimit an additional empty line"?
There's another issue for creating a central description of universal-newline parsing, perhaps this entry could link to that discussion (and that discussion could perhaps mention splitlines).
I created that issue (issue 15543), and a patch is in the works along the lines you suggest. ;)
The split behavior without a specified separator might actually be a bug (if so, it is not a fixable one), but in any case you are right that that clarification should be added if the existing sentence is kept.
Perhaps, but at least split() documents the behavior. :) "runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace." (from http://docs.python.org/dev/library/stdtypes.html#str.split ) ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

Chris Jerdonek added the comment: Attaching patch with simplified wording in response to R. David Murray's feedback. In particular, "a terminal line break does not delimit an additional empty line" -> "a terminal line break does not result in an extra line." ---------- Added file: http://bugs.python.org/file26702/issue-15554-2.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

R. David Murray added the comment: Ah, now I see what you are talking about. Yes, your revision in the comment is clearer; but, unless I read it wrong, in the patch it now sounds like you are saying that ''.splitlines() does not return the same result as ''.split() when in fact it does. I would also prefer that the "differences" discussion come in the separate paragraph after the specification of the behavior of the function, rather than the way you have it split up in the patch. I would include the mention of the lack-of-extra-line as part of the differences discussion: as I said I think that behavior follows logically from the fact that the function is splitting lines and so doesn't belong in the basic function description. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

Changes by Terry J. Reedy <tjreedy@udel.edu>: ---------- nosy: +terry.reedy _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

Chris Jerdonek added the comment:
in the patch it now sounds like you are saying that ''.splitlines() does not return the same result as ''.split() when in fact it does.
The two differences occur only when split() is passed a separator. split() uses a different algorithm when no separator is specified. For example, for the empty string case:
''.splitlines() [] ''.split() [] ''.split('\n') ['']
That is why I used the phrase "Unlike split() when passed a separator" in the patch: + Unlike :meth:`~str.split` when passed a separator, this method returns + an empty list for the empty string, and a terminal line break does not I will change the language in the patch to parallel split()'s documentation more closely, to emphasize and make this distinction clearer: "when passed a separator" -> "when a delimiter string *sep* is given".
I would also prefer that the "differences" discussion come in the separate paragraph after the specification of the behavior of the function,
Good point. I agree with you. That occurred to me while drafting the patch, but I was hesitant to change the existing structure too much. In the updated patch I am attaching, I have also made that change. Thanks a lot for reviewing! ---------- Added file: http://bugs.python.org/file26707/issue-15554-3.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

R. David Murray added the comment: Ah, I read too quickly before. But that expression "when a delimiter string *sep* is given" is hard to wrap ones head around in this context. I think the problem really is that 'split' has such radically different behavior when given an argument as opposed to when it isn't. I consider that a design flaw in strip, and always have. So, I suppose we can't do any better here because of that. Please move the keeplines discussion back up into the initial paragraph, and then I think we'll be good to go. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

Changes by R. David Murray <rdmurray@bitdance.com>: ---------- Removed message: http://bugs.python.org/msg167557 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

R. David Murray added the comment: Ah, I read too quickly before. But that expression "when a delimiter string *sep* is given" is hard to wrap ones head around in this context. I think the problem really is that 'split' has such radically different behavior when given an argument as opposed to when it isn't. I consider that a design flaw in split, and always have. So, I suppose we can't do any better here because of that. Please move the keeplines discussion back up into the initial paragraph, and then I think we'll be good to go. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

Chris Jerdonek added the comment:
I think the problem really is that 'split' has such radically different behavior when given an argument as opposed to when it isn't.
Yep, the split() documentation is much more involved because of that.
Please move the keeplines discussion back up into the initial paragraph, and then I think we'll be good to go.
Sounds good. Would you also like me to move the example before the paragraph about differences, or should I leave the example at the end? Mention of the example may flow better after the keepends discussion, because the example is more about keepends rather than about the differences with split(). But it can go either way. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

R. David Murray added the comment: Good point. Difference paragraph after the example would be best, I think. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

Chris Jerdonek added the comment: Here you go. Thanks again. ---------- Added file: http://bugs.python.org/file26709/issue-15554-4.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

Roundup Robot added the comment: New changeset 768b188262e7 by R David Murray in branch '3.2': #15554: clarify splitlines/split differences. http://hg.python.org/cpython/rev/768b188262e7 New changeset 0d6eea2330d0 by R David Murray in branch 'default': Merge #15554: clarify splitlines/split differences. http://hg.python.org/cpython/rev/0d6eea2330d0 New changeset e057a7d18fa2 by R David Murray in branch '2.7': #15554: clarify splitlines/split differences. http://hg.python.org/cpython/rev/e057a7d18fa2 ---------- nosy: +python-dev _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________

R. David Murray added the comment: Thanks for sticking with it. ---------- resolution: -> fixed stage: patch review -> committed/rejected status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue15554> _______________________________________
participants (4)
-
Chris Jerdonek
-
R. David Murray
-
Roundup Robot
-
Terry J. Reedy