[Python-Dev] utf-8 issue thread question
Brett Cannon
drifty@bigfoot.com
Tue, 10 Sep 2002 17:07:58 -0700 (PDT)
So here is the summary question for this thread: what exactly is a
surrogate? I think I get it (from reading a l18n email from MAL on the
l18n list), but I am not confident enough to stick in the summary as of
yet.
The following is my current rough summary explanation for what a surrogate
is. Can someone please correct it as needed?
"""
In Unicode, a surrogate is when you encode from a higher bit total
encoding (such as utf-16) into a smaller bit total encoding by
representing the character as several more bit chunks (such as two utf-8
chunks). The following line is an example:
>>> u'\ud800'.encode('utf-8') == '\xed\xa0\x80'
Notice how the initial Unicode character ends up being encoded as three
characters in utf-8.
"""
Also, anyone know of some good Unicode tutorials, explanations, etc. on
the web, in book form, whatever? Most of the threads that I don't totally
comprehend are Unicode related and I would like to minimize my brain-dead
questions to a minimum. Don't want my reputation to go down the drain.
=)
-Brett