[ python-Bugs-1390608 ] split() breaks no-break spaces
SourceForge.net
noreply at sourceforge.net
Fri Dec 30 14:06:23 CET 2005
Bugs item #1390608, was opened at 2005-12-26 16:03
Message generated for change (Comment added) made by lemburg
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470
Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
>Status: Closed
>Resolution: Wont Fix
Priority: 5
Submitted By: MvR (maxim_razin)
>Assigned to: M.-A. Lemburg (lemburg)
Summary: split() breaks no-break spaces
Initial Comment:
string.split(), str.split() and unicode.split() without
parameters break strings by the No-break space (U+00A0)
character. This character is specially intended not to
be a split border.
>>> u"Hello\u00A0world".split()
[u'Hello', u'world']
----------------------------------------------------------------------
>Comment By: M.-A. Lemburg (lemburg)
Date: 2005-12-30 14:06
Message:
Logged In: YES
user_id=38388
Maxim, you are right that \xA0 is a non-break space.
However, like the others already mentioned, the .split()
method defaults to breaking a string on whitespace
characters, not breakable whitespace characters. The intent
is not a typographical one, but originates from the desire
to quickly tokenize a string.
If you'd rather like to see a different set of whitespace
characters used, you can pass such a template string to the
.split() method (Walter gave an example).
Closing this as "Won't fix".
----------------------------------------------------------------------
Comment By: Walter Dörwald (doerwalter)
Date: 2005-12-30 13:35
Message:
Logged In: YES
user_id=89016
What's wrong with the following?
import sys, unicodedata
spaces = u"".join(unichr(c) for c in xrange(0,
sys.maxunicode) if unicodedata.category(unichr(c))=="Zs" and
c != 160)
foo.split(spaces)
----------------------------------------------------------------------
Comment By: Hye-Shik Chang (perky)
Date: 2005-12-30 01:30
Message:
Logged In: YES
user_id=55188
Python documentation says that it splits in "whitespace
characters" not "breaking characters". So, current
behavior is correct according to the documentation. And
even rationale among string methods are heavily depends on
ctype functions on libc. Therefore, we can't serve special
treatment for the NBSP.
However, I feel the need for the splitting function that
awares what character is breaking or not. How about to add
it as unicodedata.split()?
----------------------------------------------------------------------
Comment By: Fredrik Lundh (effbot)
Date: 2005-12-29 21:42
Message:
Logged In: YES
user_id=38376
split isn't a word-wrapping split, so I'm not sure that's
the right place to fix this. ("no-break space" is white-
space, according to the Unicode standard, and split breaks
on whitespace).
----------------------------------------------------------------------
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1390608&group_id=5470
More information about the Python-bugs-list
mailing list