[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 14:04:13 -0400


OK, focusing on a single item.

[me]
> > If this is the only thing that keeps us from having a configuration
> > OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.

[MAL]
> This is not easy to fix and can certainly not be made an
> option: UTF-16 has surrogates and is a variable width encoding
> of Unicode while UCS-4 is a fixed width encoding.

But even if we supported UTF-16 with surrogates, picking strings apart
using u[i] would still be able to access the separate lower and upper
halves of the surrogates, right, and in the presence of surrogates
len(u) would not match the number of *characters* in u.

> Python currently only has minimal support for surrogates, so
> purist would say that we support UCS-2. However, we deliberatly
> chose this path to be able to upgrade to UTF-16 at some later
> point in time and it seems that this time has now come.

How hard would it be to also change the party line about what the
encoding used is based on whether we use 2 or 4 bytes?  We could even
give three choices: UCS-2 (current situation, no surrogates), UTF-16
(16-bit items with some surrogate support) or UCS-4 (32-bit items)?

> > I'd be happy to make the configuration choice between UTF-16 and
> > UCS-4, if that's doable.
> 
> Not easily, I'm afraid.

Can you explain why this is not easy?
> http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
> """
> Decisions, decisions...
>   Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer
>   8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements,
>   UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if
>   they have not yet upgraded to fully support surrogates, they will be before long. 
> 
>   If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and
>   storage.
> """

I buy that as an argument for supporting UTF-16, but not for cutting
off the road to supporting UCS-4 for those users who would like to opt
in.

--Guido van Rossum (home page: http://www.python.org/~guido/)