[I18n-sig] How does Python Unicode treat surrogates?

Tom Emerson tree@basistech.com
Mon, 25 Jun 2001 11:25:38 -0400

Guido van Rossum writes:
> But, just as a Python 8-bit string object containing the UTF-8 encoded
> character U+20000 contains 4 bytes, with s[0] being '\xF0' etc., a
> Python "unicode" string containing that character as a surrogate will
> have length 2, with u[0] being u'\uD840' and u[1] being u'\uDC00'.
> You can think of it as containing a single character, but the
> interface gives you the individual items of the UTF-16 encoding.

So what has been implemented is UCS-2, not UTF-16, and certainly not
Unicode. Better to document u"" string literals as UCS-2, and not

> It may change *eventually* -- when we switch to UCS-4 for the internal
> representation.  Until then, the API will deal in 16-bit values that
> may or may not be "characters".

You don't need to switch to UCS-4 internally to implement what I'm

> I'd say that ideally the choice to have a 2 or 4 byte internal
> representation (or no Unicode support at all, for some platforms like
> PalmOS!) should be a configuration choice.

I don't think it should be a configuration choice. That leads to
incompatibilities between people's scripts. It's bad enough already
with some things working with threaded versions of python and some not
(e.g., Zope requires threading, but mod_python doesn't work if its
turned on).

BTW, Palm recently joined the Unicode Consortium, and Symbian has
Unicode support.

>Right now the implementation doesn't allow that choice at all, which
>should be remedied -- maybe you can help by submitting patches?


-- =

Tom Emerson                                          Basis Technology Cor=
Sr. Sinostringologist                              http://www.basistech.c=
  "Beware the lollipop of mediocrity: lick it once and you suck forever"