[I18n-sig] How does Python Unicode treat surrogates?

Mon, 25 Jun 2001 16:08:52 -0400

> I must admit that I wasn't aware of the "\U00020000" notation. I still
> think it should limit itself to 6 digits, not 8.

Too late -- It's some kind of standard already (maybe borrowed from Java?).

> I understand O(n) and O(1) perfectly well. My point is that you do not
> have to scan the entire string when doing this indexing. You only need
> to look at most one storage unit on either side of the index. We're
> only concerned here with transparently handling surrogates when the
> underlying representation is UTF-16.

And that's where your proposal simple doesn't work.  If the storage
units are all 16 bits, and you want the index to count in characters,
you can't know where in a megabyte-long string to start looking for
character 1,000,000: you have to iterate over the storage units from
the beginning until you have counted 1,000,000 characters.  If there
were no surrogates, that's 1,000,000 storage units from the beginning;
if all characters happened to be surrogates, it would be 2,000,000
storage units.  If there are n surrogates between character 0 and
character n, character n starts at storage unit offset n+m; the only
way to determine m is a brute-force O(n) search.

> > Note that in your above example, char(foo, 2) would not be u'a' but
> > would be u'\u0000', and char(foo, 3) would be u'a'.
> 
> My example above presumes that indicies in the index refers to
> characters, not storage units, and that UTF-16 is being used
> transparently internally. So in my world, evaluating
> 
> foo = u"\u4e00\U00020000a"
> 
> would treat foo[1] as u'\U00200000' and foo[2] as u'a'.
> 
> > So I still think you haven't thought this out as much as you believe.
> 
> As I said, I have no belief that this is thought out. I'm merely
> stating what I believe the observable behavior should be.

So explain once more how the observable behavior could be O(1).

--Guido van Rossum (home page: http://www.python.org/~guido/)