[Python-Dev] utf-8 issue thread question
Fredrik Lundh
fredrik@pythonware.com
Wed, 11 Sep 2002 02:24:53 +0200
Brett Cannon wrote:
> The following is my current rough summary explanation for what a =
surrogate
> is. Can someone please correct it as needed?
needed, indeed.
it's 2.30 am over here, so I'm not going to try to explain this myself,
but some random googling brought up this page:
http://216.239.37.100/search?q=3Dcache:Dk12BZNt6skC:uk.geocities.com/Babe=
lStone1357/Software/surrogates.html
The code points U+D800 through U+DB7F are reserved as High =
Surrogates,
and the code points U+DC00 through U+DFFF are reserved as Low =
Surrogates.
Each code point in [the full 20-bit unicode character space] maps to =
a pair of
16-bit code points comprising a High Surrogate followed by a Low =
Surrogate.
Thus, for example, the Gothic letter AHSA has the UTF-32 value of =
U+10330,
which maps to the surrogate pair U+D800 and U+DF30. That is to say, =
in the
16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is =
represented
by two consecutive 16-bit code points (U+D800 and U+DF30), whereas =
in the
32-bit encoding of Unicode (UTF-32), the same letter is represented =
by a
single 32-bit value (U+10330).
</F>