[I18n-sig] Support for "wide" Unicode characters

Guido van Rossum guido@digicool.com
Thu, 28 Jun 2001 10:14:31 -0400


> Guido asked:
>    Does UTF-8 transfer isolated surrogates correctly? 
> 
> No. See my bug report in SF. Briefly, a lone high
> surrogate has its leading UTF-8 byte omitted,
> causing an illegal UTF-8 sequence to be generated.
> 
> Here's the URL:
> http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43
> 3882
> 
> (or search for "surrogates")

It's a bug indeed.

But my question was about the definition of UTF8, not our (fallible)
implementation.

What *should* be the result of u'\ud800'.encode('utf8')?
'\xed\xa0\x80' or an exception?

And likewise, what should be the result of unicode('\xed\xa0\x80',
'utf8')?
u'\ud800' or an exception?

(Likewise for low surrogates; currently, u'\udc00'.encode('utf8')
returns '\xed\xb0\x80', but unicode('\xed\xb0\x80', 'utf8') raise an
exception.)

--Guido van Rossum (home page: http://www.python.org/~guido/)