[Python-Dev] utf-8 issue thread question

Guido van Rossum guido@python.org
Tue, 10 Sep 2002 20:39:14 -0400


> So here is the summary question for this thread: what exactly is a
> surrogate?

Unicode surrogates are used specifically to encode Unicode characters
with values >= 2**16 as two 16-bit code points.  The Unicode standard
has conveniently reserved two ranges for these (see /F's post).  The
first (high) surrogate encodes the high 10 bits, the second (low)
surrogate encodes the low 10 bits.  For redundancy, the top bit
pattern is different for high and low surrogates.  One thing to watch
out for: I believe that the bit pattern that's encoded is not the bit
pattern of the full unicode character, but 2**16 less.  This allows
one to encode 2**16 more characters, at the cost of some extra
complexity.

> I think I get it (from reading a l18n email from MAL on the
> l18n list), but I am not confident enough to stick in the summary as of
> yet.
> 
> The following is my current rough summary explanation for what a surrogate
> is.  Can someone please correct it as needed?
> 
> """
> In Unicode, a surrogate is when you encode from a higher bit total
> encoding (such as utf-16) into a smaller bit total encoding by
> representing the character as several more bit chunks (such as two utf-8
> chunks).  The following line is an example:
> 
> 	>>> u'\ud800'.encode('utf-8') == '\xed\xa0\x80'
> 
> Notice how the initial Unicode character ends up being encoded as three
> characters in utf-8.
> """

No, the UTF8 encoding is not called surrogate.  Only 16-bit values are
surrogates.  In this example, \ud800 is a high surrogate that's not
followed by a low surrogate.  The UTF-8 encoder could do two things
with this: encode the bit pattern, or throw an error.  Note that when
the UTF-8 encoder sees a *pair* of surrogates (a high surrogate
followed by a low surrogate), it is supposed to extract the single
unicode character from them, and encode that.  The UTF-8 decoder must
in turn create a surrogate pair when decoding to 16-bit Unicode (as
opposed to when decoding to 32-bit Unicode, when it should not
generate surrogates).

Note that there are various problems with this.  Surrogates are
illegal in 32-bit Unicode, but of course you cannot really prevent
them from occurring.  What should that mean?

> Also, anyone know of some good Unicode tutorials, explanations,
> etc. on the web, in book form, whatever?  Most of the threads that I
> don't totally comprehend are Unicode related and I would like to
> minimize my brain-dead questions to a minimum.  Don't want my
> reputation to go down the drain.  =)

I think the Unicode consortium website, www.unicode.org, has lots of
good stuff, including the complete standard online.

--Guido van Rossum (home page: http://www.python.org/~guido/)