[ python-Bugs-1105286 ] Undocumented implicit strip() in
split(None) string method
SourceForge.net
noreply at sourceforge.net
Mon Jan 24 08:15:16 CET 2005
Bugs item #1105286, was opened at 2005-01-19 10:04
Message generated for change (Comment added) made by tjreedy
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1105286&group_id=5470
Category: Documentation
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: YoHell (yohell)
Assigned to: Raymond Hettinger (rhettinger)
Summary: Undocumented implicit strip() in split(None) string method
Initial Comment:
Hi!
I noticed that the string method split() first does an
implicit strip() before splitting when it's used with
no arguments or with None as the separator (sep in the
docs). There is no mention of this implicit strip() in
the docs.
Example 1:
s = " word1 word2 "
s.split() then returns ['word1', 'word2'] and not ['',
'word1', 'word2', ''] as one might expect.
WHY IS THIS BAD?
1. Because it's undocumented. See:
http://www.python.org/doc/current/lib/string-methods.html#l2h-197
2. Because it may lead to unexpected behavior in programs.
Example 2:
FASTA sequence headers are one line descriptors of
biological sequences and are on this form:
">" + Identifier + whitespace + free text description.
Let sHeader be a Python string containing a FASTA
header. One could then use the following syntax to
extract the identifier from the header:
sID = sHeader[1:].split(None, 1)[0]
However, this does not work if sHeader contains a
faulty FASTA header where the identifier is missing or
consists of whitespace. In that case sID will contain
the first word of the free text description, which is
not the desired behavior.
WHAT SHOULD BE DONE?
The implicit strip() should be removed, or at least
should programmers be given the option to turn it off.
At the very least it should be documented so that
programmers have a chance of adapting their code to it.
Thank you for an otherwise splendid language!
/Joel Hedlund
Ph.D. Student
IFM Bioinformatics
Linköping University
----------------------------------------------------------------------
Comment By: Terry J. Reedy (tjreedy)
Date: 2005-01-24 02:15
Message:
Logged In: YES
user_id=593130
To me, the removal of whitespace at the ends (stripping) is
consistent with the removal (or collapsing) of extra
whitespace in between so that .split() does not return empty
words anywhere. Consider:
>>> ',1,,2,'.split(',')
['', '1', '', '2', '']
If ' 1 2 '.split() were to return null strings at the beginning
and end of the list, then to be consistent, it should also put
one in the middle. One can get this by being explicit (mixed
WS can be handled by translation):
>>> ' 1 2 '.split(' ')
['', '1', '', '2', '']
Having said this, I also agree that the extra words proposed
by jj are helpful.
BUG?? In 2.2, splitting an empty or whitespace only string
produces an empty list [], not a list with a null word [''].
>>> ''.split()
[]
>>> ' '.split()
[]
which is what I see as consistent with the rest of the no-null-
word behavior. Has this changed since? (Yes, must
upgrade.) I could find no indication of such change in either
the tracker or CVS.
----------------------------------------------------------------------
Comment By: YoHell (yohell)
Date: 2005-01-20 09:59
Message:
Logged In: YES
user_id=1008220
Brilliant, guys!
Thanks again for a superb scripting language, and with
documentation to match!
Take care!
/Joel Hedlund
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2005-01-20 09:50
Message:
Logged In: YES
user_id=80475
The prosposed wording is fine.
If there are no objections or concerns, I'll apply it soon.
----------------------------------------------------------------------
Comment By: Jim Jewett (jimjjewett)
Date: 2005-01-20 09:28
Message:
Logged In: YES
user_id=764593
Replacing the quoted line:
"""
...
If sep is not specified or is None, a different splitting
algorithm is applied. First whitespace (spaces, tabs,
newlines, returns, and formfeeds) is stripped from both
ends. Then words are separated by arbitrary length
strings of whitespace characters . Consecutive whitespace
delimiters are treated as a single delimiter ("'1 2 3'.split()"
returns "['1', '2', '3']"). Splitting an empty (or whitespace-
only) string returns "['']".
"""
----------------------------------------------------------------------
Comment By: Raymond Hettinger (rhettinger)
Date: 2005-01-20 09:04
Message:
Logged In: YES
user_id=80475
What new wording do you propose to be added?
----------------------------------------------------------------------
Comment By: YoHell (yohell)
Date: 2005-01-20 05:15
Message:
Logged In: YES
user_id=1008220
In RE to tim_one:
> I think the docs for split() under "String Methods" are quite
> clear:
On the countrary, my friend, and here's why:
> """
> ...
> If sep is not specified or is None, a different splitting
> algorithm is applied.
This sentecnce does not say that whitespace will be
implicitly stripped from the edges of the string.
> Words are separated by arbitrary length strings of whitespace
> characters (spaces, tabs, newlines, returns, and formfeeds).
Neither does this one.
> Consecutive whitespace delimiters are treated as a single
delimiter ("'1
> 2 3'.split()" returns "['1', '2', '3']").
And not that one.
> Splitting an empty string returns "['']".
> """
And that last one does not mention it either. In fact, there
is no mention in the docs of how separators on edges of
strings are treated by the split method. And furthermore,
there is no mention of that s.split(sep) treats them
differrently when sep is None than it does otherwise. Example:
>>> ",2,".split(',')
['', '2', '']
>>> " 2 ".split()
['2']
This inconsistent behavior is not in line with how
beautifully thought out the Python language is otherwise,
and how brilliantly everything else is documented on the
http://python.org/doc/ documentation pages.
> This won't change, because mountains of code rely on this
> behavior -- it's probably the single most common use case
> for .split().
I thought as much. However - it's would be Really easy for
an admin to add a line of documentation to .split() to
explain this. That would certainly help make me a happier
man, and hopefully others too.
Cheers guys!
/Joel
----------------------------------------------------------------------
Comment By: Tim Peters (tim_one)
Date: 2005-01-19 11:56
Message:
Logged In: YES
user_id=31435
I think the docs for split() under "String Methods" are quite
clear:
"""
...
If sep is not specified or is None, a different splitting
algorithm is applied. Words are separated by arbitrary length
strings of whitespace characters (spaces, tabs, newlines,
returns, and formfeeds). Consecutive whitespace delimiters
are treated as a single delimiter ("'1 2 3'.split()"
returns "['1', '2', '3']"). Splitting an empty string returns "['']".
"""
This won't change, because mountains of code rely on this
behavior -- it's probably the single most common use case
for .split().
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1105286&group_id=5470
More information about the Python-bugs-list
mailing list