[I18n-sig] Unicode surrogates: just say no!

Tom Emerson tree@basistech.com
Tue, 26 Jun 2001 08:40:27 -0400


Guido van Rossum writes:
> 3. The ideal situation.  This uses UCS-4 for storage and doesn't
>    require any support for surrogates except in the UTF-16 codecs (and
>    maybe in the UTF-8 codecs; it seems that encoded surrogate pairs
>    are legal in UTF-8 streams but should be converted back to a single
>    character).  It's unclear to me whether the (illegal, according to
>    the Unicode standard) "characters" whose numerical value looks like
>    a lone surrogate should be entirely ruled out here, or whether a
>    dedicated programmer could create strings containing these.  We
>    could make it hard by declaring unichr(i) with surrogate i and \u
>    and \U escapes that encode surrogates illegal, and by adding
>    explicit checks to codecs as appropriate, but a C extension could
>    still create an array containing illegal characters unless we do
>    draconian input checking.

UTF-8 can be used to encode encode each half of a surrogate pair
(resulting in six-bytes for the character) --- a proposal for this was
presented by PeopleSoft at the UTC meeting last month. UTF-8 can also
encode the code-point directly in four bytes.

As Marc-Andre said in his response, you can have a valid stream of Unicode
characters with half a surrogate pair: that character, however, is
undefined.

> I see only one remaining argument against choosing 3 over 2: FUD about
> disk and promary memory space usage.

At the last IUC in Hong Kong some developers from SAP presented data
against the use of UCS-4/UTF-32 as an internal representation. In
their benchmarks they found that the overhead of cache-misses due to
the increased character width were far more detrimental to runtime
than having to deal with the odd surrogate pair in a UTF-16 encoded
string. After the presentation several people (myself, Asmus Freytag,
Toby Phipps of PeopleSoft, and Paul Laenger of Software AG) had a
little chat about this issue and couldn't agree whether this was
really a big problem or not. I think it bears more research.

However, I agree that using UCS-4/UTF-32 as the internal string
representation is the best solution.

Remember too that glibc uses UCS-4 as its internal wchar_t
representation. This was also discussed at the Li18nux meetings a
couple of years ago.

> A. At some Python version, we switch.
> 
> B. Choose between 1 and 3 based on the platform.
> 
> C. Make it a configuration-time choice.

Defaulting to UCS-4?

> We could use B to determine the default choice, e.g. we could choose
> between option 1 and 3 depending on the platform's wchar_t; but it
> would be bad not to have a way to override this default, so we
> couldn't exploit the correspondence much.  Some code could be
> #ifdef'ed out when Py_UNICODE == wchar_t, but there would always have
> to be code to support these two having different sizes.

Seems to me this could add complexity and reliance on platform
functionality that may not be consistent. Is the savings worth the complexity?

> The outcome of the choice must be available at run-time, because it
> may affect certain codecs.  Maybe sys.maxunicode could be the largest
> character value supported, i.e. 0xffff or 0xfffff?

or 0x10ffff?

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"