[Python-Dev] utf-8 issue thread question
Brett Cannon
drifty@bigfoot.com
Tue, 10 Sep 2002 17:40:00 -0700 (PDT)
[Fredrik Lundh]
> Brett Cannon wrote:
>
> it's 2.30 am over here, so I'm not going to try to explain this myself,
> but some random googling brought up this page:
>
> http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelStone1357/Software/surrogates.html
>
> The code points U+D800 through U+DB7F are reserved as High Surrogates,
> and the code points U+DC00 through U+DFFF are reserved as Low Surrogates.
> Each code point in [the full 20-bit unicode character space] maps to a pair of
> 16-bit code points comprising a High Surrogate followed by a Low Surrogate.
> Thus, for example, the Gothic letter AHSA has the UTF-32 value of U+10330,
> which maps to the surrogate pair U+D800 and U+DF30. That is to say, in the
> 16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is represented
> by two consecutive 16-bit code points (U+D800 and U+DF30), whereas in the
> 32-bit encoding of Unicode (UTF-32), the same letter is represented by a
> single 32-bit value (U+10330).
>
> </F>
>
So with that explanation, here is the current rewrite:
"""
In Unicode, a surrogate pair is when you create the representation of a
character by using two values. So, for instance, UTF-32 can cover the
entire Unicode space (Unicode is 20 bits), but UTF-16 can't. To solve the
issue a character can be represented as a pair of UTF-16 values.
The problem in Python 2.2.1 is that when there is only a lone surrogate
(instead of there being a pair of values), the encoder for UTF-8 messes up
and leaves off a UTF-8 value. The following line is an example:
>>> u'\ud800'.encode('utf-8')
'\xa0\x80' #In Python 2.2.1
'\xed\xa0\x80' #In Python 2.3a0
Notice how in Python 2.3a0 the extra value is inserted so as to make the
representation a complete Unicode character instead of only encoding the
half of the surrogate pair that the encode was given.
"""
How is that?
-Brett