[I18n-sig] How does Python Unicode treat surrogates?

Tom Emerson tree@basistech.com
Mon, 25 Jun 2001 15:33:35 -0400

Guido van Rossum writes:
> And that's where your proposal simple doesn't work.  If the storage
> units are all 16 bits, and you want the index to count in characters,
> you can't know where in a megabyte-long string to start looking for
> character 1,000,000: you have to iterate over the storage units from
> the beginning until you have counted 1,000,000 characters.  If there
> were no surrogates, that's 1,000,000 storage units from the beginning;
> if all characters happened to be surrogates, it would be 2,000,000
> storage units.  If there are n surrogates between character 0 and
> character n, character n starts at storage unit offset n+m; the only
> way to determine m is a brute-force O(n) search.

Bing, the light goes on. Of course. "Never mind." :-)

Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"