
[Fredrik Lundh]
Brett Cannon wrote:
it's 2.30 am over here, so I'm not going to try to explain this myself, but some random googling brought up this page:
http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelSton...
The code points U+D800 through U+DB7F are reserved as High Surrogates, and the code points U+DC00 through U+DFFF are reserved as Low Surrogates. Each code point in [the full 20-bit unicode character space] maps to a pair of 16-bit code points comprising a High Surrogate followed by a Low Surrogate. Thus, for example, the Gothic letter AHSA has the UTF-32 value of U+10330, which maps to the surrogate pair U+D800 and U+DF30. That is to say, in the 16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is represented by two consecutive 16-bit code points (U+D800 and U+DF30), whereas in the 32-bit encoding of Unicode (UTF-32), the same letter is represented by a single 32-bit value (U+10330).
</F>
So with that explanation, here is the current rewrite: """ In Unicode, a surrogate pair is when you create the representation of a character by using two values. So, for instance, UTF-32 can cover the entire Unicode space (Unicode is 20 bits), but UTF-16 can't. To solve the issue a character can be represented as a pair of UTF-16 values. The problem in Python 2.2.1 is that when there is only a lone surrogate (instead of there being a pair of values), the encoder for UTF-8 messes up and leaves off a UTF-8 value. The following line is an example: >>> u'\ud800'.encode('utf-8') '\xa0\x80' #In Python 2.2.1 '\xed\xa0\x80' #In Python 2.3a0 Notice how in Python 2.3a0 the extra value is inserted so as to make the representation a complete Unicode character instead of only encoding the half of the surrogate pair that the encode was given. """ How is that? -Brett