Report on non-breaking spaces in posts

Tue Oct 31 17:37:27 EDT 2017

Rhodri James <rhodri at kynesim.co.uk> writes:

> On 31/10/17 17:23, Stefan Ram wrote:
>> Ned Batchelder <ned at nedbatchelder.com> writes:
>>> Â Â Â  def wrapped_join(values, sep):
>>
>>    Ok, here's a report on me seing non-breaking spaces in
>>    posts in this NG. I have written this report so that you
>>    can see that it's not my newsreader that is converting
>>    something, because there is no newsreader involved.
>>
>>    Here are some relevant lines from Ned's above post:
>>
>> |From: Ned Batchelder <ned at nedbatchelder.com>
>> |Newsgroups: comp.lang.python
>> |Subject: Re: How to join elements at the beginning and end of the list
>> |Message-ID: <mailman.95.1509464977.1490.python-list at python.org>
>
> Hm.  That suggests the mail-to-news gateway has a hand in things.
>
>> |Content-Type: text/plain; charset=utf-8; format=flowed
>> |Content-Transfer-Encoding: 8bit
>> | Â Â Â  def wrapped_join(values, sep):
>
> [snippety snip]
>
>> |od -c tmp.txt
>> |...
>> |0012620   s   u   l   a   t   e       i   t   :  \n  \n       Â       Â
>> |0012640       Â           d   e   f       w   r   a   p   p   e   d   _
>> |...
>> |
>> |od -x tmp.txt
>> |...
>> |0012620 7573 616c 6574 6920 3a74 0a0a c220 c2a0
>> |0012640 c2a0 20a0 6564 2066 7277 7061 6570 5f64
>> |...
>>
>>    And you can see, there are two octet pairs »c220« and
>>    »c2a0« in the post (directly preceding »def wrapped«).
>>    (Compare with the Content-Type and Content-Transfer-Encoding
>>    given above.) (Read table with a monospaced font:)
>>
>>                          corresponding
>> Codepoint      UTF-8    ISO-8859-1      interpretation
>>
>> U+0020?        c2 20    20?             SPACE?
>> U+00A0         c2 a0    a0              NON-BREAKING SPACE
>>
>>    This makes it clear that there really are codepoints
>>    U+00A0 in what I get from the server, i.e., non-breaking
>>    spaces directly in front of »def wrapped«.
>
> And?  Why does that bother you?  A non-breaking space is a perfectly
> valid thing to put into a UTF-8 encoded message.

But it's an odd thing to put into Python code (at least there).  If the
Usenet client is doing it that's surely bad as the code won't run
without editing.

> The 0xc2 0x20 byte
> pair that you misidentify as a space is another matter entirely.
>
> 0xc2 0x20 is not a space in UTF-8.  It is an invalid code sequence.  I
> don't know how or where it was generated, but it really shouldn't have
> been.

It wasn't there.  It was down to a misreading of the byte-order in the
hex dump.

<snip>
-- 
Ben.