
So here is the summary question for this thread: what exactly is a surrogate?
Unicode surrogates are used specifically to encode Unicode characters with values >= 2**16 as two 16-bit code points. The Unicode standard has conveniently reserved two ranges for these (see /F's post). The first (high) surrogate encodes the high 10 bits, the second (low) surrogate encodes the low 10 bits. For redundancy, the top bit pattern is different for high and low surrogates. One thing to watch out for: I believe that the bit pattern that's encoded is not the bit pattern of the full unicode character, but 2**16 less. This allows one to encode 2**16 more characters, at the cost of some extra complexity.
I think I get it (from reading a l18n email from MAL on the l18n list), but I am not confident enough to stick in the summary as of yet.
The following is my current rough summary explanation for what a surrogate is. Can someone please correct it as needed?
""" In Unicode, a surrogate is when you encode from a higher bit total encoding (such as utf-16) into a smaller bit total encoding by representing the character as several more bit chunks (such as two utf-8 chunks). The following line is an example:
u'\ud800'.encode('utf-8') == '\xed\xa0\x80'
Notice how the initial Unicode character ends up being encoded as three characters in utf-8. """
No, the UTF8 encoding is not called surrogate. Only 16-bit values are surrogates. In this example, \ud800 is a high surrogate that's not followed by a low surrogate. The UTF-8 encoder could do two things with this: encode the bit pattern, or throw an error. Note that when the UTF-8 encoder sees a *pair* of surrogates (a high surrogate followed by a low surrogate), it is supposed to extract the single unicode character from them, and encode that. The UTF-8 decoder must in turn create a surrogate pair when decoding to 16-bit Unicode (as opposed to when decoding to 32-bit Unicode, when it should not generate surrogates). Note that there are various problems with this. Surrogates are illegal in 32-bit Unicode, but of course you cannot really prevent them from occurring. What should that mean?
Also, anyone know of some good Unicode tutorials, explanations, etc. on the web, in book form, whatever? Most of the threads that I don't totally comprehend are Unicode related and I would like to minimize my brain-dead questions to a minimum. Don't want my reputation to go down the drain. =)
I think the Unicode consortium website, www.unicode.org, has lots of good stuff, including the complete standard online. --Guido van Rossum (home page: http://www.python.org/~guido/)