utf-8 issue thread question

So here is the summary question for this thread: what exactly is a surrogate? I think I get it (from reading a l18n email from MAL on the l18n list), but I am not confident enough to stick in the summary as of yet. The following is my current rough summary explanation for what a surrogate is. Can someone please correct it as needed? """ In Unicode, a surrogate is when you encode from a higher bit total encoding (such as utf-16) into a smaller bit total encoding by representing the character as several more bit chunks (such as two utf-8 chunks). The following line is an example: >>> u'\ud800'.encode('utf-8') == '\xed\xa0\x80' Notice how the initial Unicode character ends up being encoded as three characters in utf-8. """ Also, anyone know of some good Unicode tutorials, explanations, etc. on the web, in book form, whatever? Most of the threads that I don't totally comprehend are Unicode related and I would like to minimize my brain-dead questions to a minimum. Don't want my reputation to go down the drain. =) -Brett

Brett Cannon wrote:
The following is my current rough summary explanation for what a surrogate is. Can someone please correct it as needed?
needed, indeed. it's 2.30 am over here, so I'm not going to try to explain this myself, but some random googling brought up this page: http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelSton... The code points U+D800 through U+DB7F are reserved as High Surrogates, and the code points U+DC00 through U+DFFF are reserved as Low Surrogates. Each code point in [the full 20-bit unicode character space] maps to a pair of 16-bit code points comprising a High Surrogate followed by a Low Surrogate. Thus, for example, the Gothic letter AHSA has the UTF-32 value of U+10330, which maps to the surrogate pair U+D800 and U+DF30. That is to say, in the 16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is represented by two consecutive 16-bit code points (U+D800 and U+DF30), whereas in the 32-bit encoding of Unicode (UTF-32), the same letter is represented by a single 32-bit value (U+10330). </F>

[Fredrik Lundh]
Brett Cannon wrote:
it's 2.30 am over here, so I'm not going to try to explain this myself, but some random googling brought up this page:
http://216.239.37.100/search?q=cache:Dk12BZNt6skC:uk.geocities.com/BabelSton...
The code points U+D800 through U+DB7F are reserved as High Surrogates, and the code points U+DC00 through U+DFFF are reserved as Low Surrogates. Each code point in [the full 20-bit unicode character space] maps to a pair of 16-bit code points comprising a High Surrogate followed by a Low Surrogate. Thus, for example, the Gothic letter AHSA has the UTF-32 value of U+10330, which maps to the surrogate pair U+D800 and U+DF30. That is to say, in the 16-bit encoding of Unicode (UTF-16), the Gothic letter AHSA is represented by two consecutive 16-bit code points (U+D800 and U+DF30), whereas in the 32-bit encoding of Unicode (UTF-32), the same letter is represented by a single 32-bit value (U+10330).
</F>
So with that explanation, here is the current rewrite: """ In Unicode, a surrogate pair is when you create the representation of a character by using two values. So, for instance, UTF-32 can cover the entire Unicode space (Unicode is 20 bits), but UTF-16 can't. To solve the issue a character can be represented as a pair of UTF-16 values. The problem in Python 2.2.1 is that when there is only a lone surrogate (instead of there being a pair of values), the encoder for UTF-8 messes up and leaves off a UTF-8 value. The following line is an example: >>> u'\ud800'.encode('utf-8') '\xa0\x80' #In Python 2.2.1 '\xed\xa0\x80' #In Python 2.3a0 Notice how in Python 2.3a0 the extra value is inserted so as to make the representation a complete Unicode character instead of only encoding the half of the surrogate pair that the encode was given. """ How is that? -Brett

So here is the summary question for this thread: what exactly is a surrogate?
Unicode surrogates are used specifically to encode Unicode characters with values >= 2**16 as two 16-bit code points. The Unicode standard has conveniently reserved two ranges for these (see /F's post). The first (high) surrogate encodes the high 10 bits, the second (low) surrogate encodes the low 10 bits. For redundancy, the top bit pattern is different for high and low surrogates. One thing to watch out for: I believe that the bit pattern that's encoded is not the bit pattern of the full unicode character, but 2**16 less. This allows one to encode 2**16 more characters, at the cost of some extra complexity.
I think I get it (from reading a l18n email from MAL on the l18n list), but I am not confident enough to stick in the summary as of yet.
The following is my current rough summary explanation for what a surrogate is. Can someone please correct it as needed?
""" In Unicode, a surrogate is when you encode from a higher bit total encoding (such as utf-16) into a smaller bit total encoding by representing the character as several more bit chunks (such as two utf-8 chunks). The following line is an example:
u'\ud800'.encode('utf-8') == '\xed\xa0\x80'
Notice how the initial Unicode character ends up being encoded as three characters in utf-8. """
No, the UTF8 encoding is not called surrogate. Only 16-bit values are surrogates. In this example, \ud800 is a high surrogate that's not followed by a low surrogate. The UTF-8 encoder could do two things with this: encode the bit pattern, or throw an error. Note that when the UTF-8 encoder sees a *pair* of surrogates (a high surrogate followed by a low surrogate), it is supposed to extract the single unicode character from them, and encode that. The UTF-8 decoder must in turn create a surrogate pair when decoding to 16-bit Unicode (as opposed to when decoding to 32-bit Unicode, when it should not generate surrogates). Note that there are various problems with this. Surrogates are illegal in 32-bit Unicode, but of course you cannot really prevent them from occurring. What should that mean?
Also, anyone know of some good Unicode tutorials, explanations, etc. on the web, in book form, whatever? Most of the threads that I don't totally comprehend are Unicode related and I would like to minimize my brain-dead questions to a minimum. Don't want my reputation to go down the drain. =)
I think the Unicode consortium website, www.unicode.org, has lots of good stuff, including the complete standard online. --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum <guido@python.org> writes:
One thing to watch out for: I believe that the bit pattern that's encoded is not the bit pattern of the full unicode character, but 2**16 less. This allows one to encode 2**16 more characters, at the cost of some extra complexity.
Correct. That allows to encode a total of 17 planes in Unicode, a plane being 2**16 characters. Therefore, saying that Unicode is 20 bits is somewhat imprecise - its better to say that it is 21 bits. Regards, Martin
participants (4)
-
Brett Cannon
-
Fredrik Lundh
-
Guido van Rossum
-
martin@v.loewis.de