[I18n-sig] How does Python Unicode treat surrogates?
Tom Emerson
tree@basistech.com
Mon, 25 Jun 2001 13:43:23 -0400
Guido van Rossum writes:
> To extract the n'th Unicode character you would have to loop over all
> the preceding characters checking for surrogates. This makes it O(n).
No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF)
then look at the n+1'th character for a valid low-surrogate. If the
n'th character is a valid low-surrogate and the n-1'th character is a
valid high-surrogate, then skip it.
> It's a common Python idiom to read megabytes of text into a single
> (8-bit or 16-bit) string object, so changing O(1) to O(n) is a real
> problem!
Yes, I do it all the time... my primary use of Python is managing
Chinese and Japanese lexicographic data where the files are upwards of
25+MB of UTF-8 encoded Unicode text.
--
Tom Emerson Basis Technology Corp.
Sr. Sinostringologist http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"