[Python-Dev] utf-8 issue thread question

Brett Cannon drifty@bigfoot.com
Tue, 10 Sep 2002 17:40:00 -0700 (PDT)

[Fredrik Lundh]

> Brett Cannon wrote:
> it's 2.30 am over here, so I'm not going to try to explain this myself,
> but some random googling brought up this page:
>     The code points U+D800 through U+DB7F are reserved as High Surrogates,
>     and the code points U+DC00 through U+DFFF are reserved as Low Surrogates.
>     Each code point in [the full 20-bit unicode character space] maps to a pair of
>     16-bit code points comprising a High Surrogate followed by a Low Surrogate.
>     Thus, for example, the Gothic letter AHSA has the UTF-32 value of U+10330,
>     which maps to the surrogate pair U+D800 and U+DF30. That is to say, in the
>     16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is represented
>     by two consecutive 16-bit code points (U+D800 and U+DF30), whereas in the
>     32-bit encoding of Unicode (UTF-32), the same letter is represented by a
>     single 32-bit value (U+10330).
> </F>

So with that explanation, here is the current rewrite:

In Unicode, a surrogate pair is when you create the representation of a
character by using two values. So, for instance, UTF-32 can cover the
entire Unicode space (Unicode is 20 bits), but UTF-16 can't.  To solve the
issue a character can be represented as a pair of UTF-16 values.

The problem in Python 2.2.1 is that when there is only a lone surrogate
(instead of there being a pair of values), the encoder for UTF-8 messes up
and leaves off a UTF-8 value.  The following line is an example:

	>>> u'\ud800'.encode('utf-8')
	'\xa0\x80'  #In Python 2.2.1
	'\xed\xa0\x80'  #In Python 2.3a0

Notice how in Python 2.3a0 the extra value is inserted so as to make the
representation a complete Unicode character instead of only encoding the
half of the surrogate pair that the encode was given.

How is that?