Python NBSP DWIM
Steven D'Aprano
steve at pearwood.info
Wed Jun 10 13:11:47 EDT 2015
On Thu, 11 Jun 2015 12:28 am, Skip Montanaro wrote:
> On Wed, Jun 10, 2015 at 8:28 AM, Tim Chase
> <python.list at tim.thechases.com> wrote:
>> Is this a bug?
>
> Looks like it's been reported a few times with slightly different context:
>
> https://bugs.python.org/issue6537
> https://bugs.python.org/issue16623
> https://bugs.python.org/issue20491
> https://bugs.python.org/issue1390608
>
> The couple times it's come up in the context of str.split, it's been
> rejected, since the purpose of that method is to split words.
That reasoning is ... strange. The whole point of the NBSP is specifically
*not* to split on it. If you wanted it to split, you would use a regular
space.
(Oh, and for the record, there are at least two non-breaking spaces in
Unicode, U+00A0 "NO-BREAK SPACE" and U+202F "NARROW NO-BREAK SPACE".)
http://www.unicode.org/charts/PDF/U0080.pdf
http://www.unicode.org/charts/PDF/U2000.pdf
Non-breaking spaces should be used for when you want to prevent
word-wrapping, and also for "open form" compound words:
http://grammar.ccc.commnet.edu/grammar/compounds.htm
textwrap should also treat NBSPs as non-spaces for the purposes of wrapping.
As a work-around, I think this should work:
- split the string on NBSPs;
- for substring returned, split normally;
- merge sub-substrings.
def split(s):
"""Split on whitespace, except NBSP.
>>> split(u'hello world spam\\u00A0eggs cheese')
[u'hello', u'world', u'spam\\xa0eggs', 'cheese']
"""
words = []
NBSP = u'\u00A0'
substrings = s.split(NBSP)
for i, sub in enumerate(substrings):
parts = sub.split()
if i == 0:
words.extend(parts)
else:
words[-1] += NBSP + parts[0]
words.extend(parts[1:])
return words
--
Steven
More information about the Python-list
mailing list