[I18n-sig] How does Python Unicode treat surrogates?

Tom Emerson tree@basistech.com
Mon, 25 Jun 2001 10:36:10 -0400


Guido van Rossum writes:
> Depends on what you call transparent.  I'm all for smart codecs
> between UTF-16 and UTF-8, but if you have a surrogate in a Unicode
> string, the application will have to know not to split it in the
> middle, and it must realize that len(u) is not necessarily the number
> of characters -- it's the number of 16-bit units in the UTF-16
> encoding.

Surrogates were created as a way to allow characters outside Plane 0
(the BMP) to be accessed within a sixteen-bit codespace. When using
UTF-16 a character constists of either two-octets or four-octets. A
character that cannot be represented within the 16-bit code space is
encoded using a surrogate pair, but it is the same character
regardless.

So, for example, the ideograph at U+20000 is the same character
whether it is encoded as <20000> (UCS-4, UTF-32), <D840 DC00>
(UTF-16), or <F0 A0 80 80> (UTF-8). It doesn't matter what
transformation format you use: it's the *same* character.

Hence, when I have Unicode string, I'm thinking of each character as a
Unicode character, not as a sequence of UTF-16 or UCS-2 two-octet
words.

Hence my belief that Unicode strings should not be synonymous with the
underlying physical character representation is used.

Clear as mud? :-)

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"