[Python-Dev] utf-8 issue thread question

Fredrik Lundh fredrik@pythonware.com
Wed, 11 Sep 2002 02:24:53 +0200


Brett Cannon wrote:

> The following is my current rough summary explanation for what a =
surrogate
> is.  Can someone please correct it as needed?

needed, indeed.

it's 2.30 am over here, so I'm not going to try to explain this myself,
but some random googling brought up this page:

http://216.239.37.100/search?q=3Dcache:Dk12BZNt6skC:uk.geocities.com/Babe=
lStone1357/Software/surrogates.html

    The code points U+D800 through U+DB7F are reserved as High =
Surrogates,
    and the code points U+DC00 through U+DFFF are reserved as Low =
Surrogates.
    Each code point in [the full 20-bit unicode character space] maps to =
a pair of
    16-bit code points comprising a High Surrogate followed by a Low =
Surrogate.
    Thus, for example, the Gothic letter AHSA has the UTF-32 value of =
U+10330,
    which maps to the surrogate pair U+D800 and U+DF30. That is to say, =
in the
    16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is =
represented
    by two consecutive 16-bit code points (U+D800 and U+DF30), whereas =
in the
    32-bit encoding of Unicode (UTF-32), the same letter is represented =
by a
    single 32-bit value (U+10330).

</F>