[Python-Dev] utf-8 issue thread question

Brett Cannon drifty@bigfoot.com
Tue, 10 Sep 2002 17:07:58 -0700 (PDT)

So here is the summary question for this thread: what exactly is a
surrogate?  I think I get it (from reading a l18n email from MAL on the
l18n list), but I am not confident enough to stick in the summary as of

The following is my current rough summary explanation for what a surrogate
is.  Can someone please correct it as needed?

In Unicode, a surrogate is when you encode from a higher bit total
encoding (such as utf-16) into a smaller bit total encoding by
representing the character as several more bit chunks (such as two utf-8
chunks).  The following line is an example:

	>>> u'\ud800'.encode('utf-8') == '\xed\xa0\x80'

Notice how the initial Unicode character ends up being encoded as three
characters in utf-8.

Also, anyone know of some good Unicode tutorials, explanations, etc. on
the web, in book form, whatever?  Most of the threads that I don't totally
comprehend are Unicode related and I would like to minimize my brain-dead
questions to a minimum.  Don't want my reputation to go down the drain.