[I18n-sig] Support for "wide" Unicode characters

Machin, John JMachin@Colonial.com.au
Fri, 29 Jun 2001 08:08:04 +1000

[John Machin]
> Guido asked:
>    Does UTF-8 transfer isolated surrogates correctly? 
> No. See my bug report in SF. Briefly, a lone high
> surrogate has its leading UTF-8 byte omitted,
> causing an illegal UTF-8 sequence to be generated.
> Here's the URL:
> 3882
> (or search for "surrogates")

[Guido again]
It's a bug indeed.

But my question was about the definition of UTF8, not our (fallible)

What *should* be the result of u'\ud800'.encode('utf8')?
'\xed\xa0\x80' or an exception?

And likewise, what should be the result of unicode('\xed\xa0\x80',
u'\ud800' or an exception?

(Likewise for low surrogates; currently, u'\udc00'.encode('utf8')
returns '\xed\xb0\x80', but unicode('\xed\xb0\x80', 'utf8') raise an

[John Machin]
OK, sorry for the misunderstanding.
A UTF-8 codec can be made to transcode scalars up to at least 31 bits wide.
The ISO 10646 specification allows for this. 

So, for marshalling and (pickling?) purposes, calling the UTF-8 codec with
errors='liberal' would be the way to go. IMO, 'liberal' should still give an
exception for over-long UTF-8 byte sequences -- an encoder which produces
such is broken (either accidentally or deliberately) -- but should happily
transcode any scalar value <= X for some X in (0x10FFFF, 0x7FFFFFFF).

IMO, when errors is 'strict', upper limit should be 0xFFFF for narrow
and 0x10FFFF for wide builds.

IMO, unicode(), u.encode() and the \U notation should all use 'strict' ...
perhaps the exception messages produced by the narrow build could be 
marketing-aligned and point the punter to the wide build.


