
Brett Cannon wrote:
The following is my current rough summary explanation for what a surrogate is. Can someone please correct it as needed?
needed, indeed. it's 2.30 am over here, so I'm not going to try to explain this myself, but some random googling brought up this page: http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelSton... The code points U+D800 through U+DB7F are reserved as High Surrogates, and the code points U+DC00 through U+DFFF are reserved as Low Surrogates. Each code point in [the full 20-bit unicode character space] maps to a pair of 16-bit code points comprising a High Surrogate followed by a Low Surrogate. Thus, for example, the Gothic letter AHSA has the UTF-32 value of U+10330, which maps to the surrogate pair U+D800 and U+DF30. That is to say, in the 16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is represented by two consecutive 16-bit code points (U+D800 and U+DF30), whereas in the 32-bit encoding of Unicode (UTF-32), the same letter is represented by a single 32-bit value (U+10330). </F>