[I18n-sig] Support for "wide" Unicode characters

Machin, John JMachin@Colonial.com.au
Thu, 28 Jun 2001 23:41:08 +1000


[Guido van Rossum]
> store Unicode strings using UTF-8.
> 
> Does UTF-8 transfer isolated surrogates correctly?  

[Marc-Andre Lemburg}
It handles surrogates correctly, but rejects isolated ones on input
(easy to fix though) and passes them through on output. As I said
before, surrogate is far from being complete.

Marc-Andre, there is a *bug* in 2.1 encoding isolated high surrogates. I
reported it
and you assigned it to yourself on 23 June. Lookee here:

Python 2.1 (#15, Apr 16 2001, 18:25:49) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
>>> u'\ud800'.encode('utf-8')
'\xa0\x80' # should be 3 bytes, not 2
>>>

While the fix is trivial, IMO an appropriate answer to Guido's question
would include
this particular lack of correctness.

Cheers,
John



**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************