[I18n-sig] How does Python Unicode treat surrogates?

Tom Emerson tree@basistech.com
Mon, 25 Jun 2001 13:43:23 -0400

Guido van Rossum writes:
> To extract the n'th Unicode character you would have to loop over all
> the preceding characters checking for surrogates.  This makes it O(n).

No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF)
then look at the n+1'th character for a valid low-surrogate. If the
n'th character is a valid low-surrogate and the n-1'th character is a
valid high-surrogate, then skip it.

> It's a common Python idiom to read megabytes of text into a single
> (8-bit or 16-bit) string object, so changing O(1) to O(n) is a real
> problem!

Yes, I do it all the time... my primary use of Python is managing
Chinese and Japanese lexicographic data where the files are upwards of
25+MB of UTF-8 encoded Unicode text.

Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"