[I18n-sig] Support for "wide" Unicode characters
Guido van Rossum
guido@digicool.com
Thu, 28 Jun 2001 10:14:31 -0400
> Guido asked:
> Does UTF-8 transfer isolated surrogates correctly?
>
> No. See my bug report in SF. Briefly, a lone high
> surrogate has its leading UTF-8 byte omitted,
> causing an illegal UTF-8 sequence to be generated.
>
> Here's the URL:
> http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43
> 3882
>
> (or search for "surrogates")
It's a bug indeed.
But my question was about the definition of UTF8, not our (fallible)
implementation.
What *should* be the result of u'\ud800'.encode('utf8')?
'\xed\xa0\x80' or an exception?
And likewise, what should be the result of unicode('\xed\xa0\x80',
'utf8')?
u'\ud800' or an exception?
(Likewise for low surrogates; currently, u'\udc00'.encode('utf8')
returns '\xed\xb0\x80', but unicode('\xed\xb0\x80', 'utf8') raise an
exception.)
--Guido van Rossum (home page: http://www.python.org/~guido/)