[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 15:22:58 -0400


> Guido van Rossum writes:
> > Ouch.  So suppose we have a string u containing four items: a regular
> > 16-bit char, a high surrogate, a low surrogate, and another regular
> > 16-bit char.  You're saying that u[0] should return the first
> > character, u[1] the entire surrogate (so it would still be a 2-item
> > string), u[2] I gues the empty string, and u[3] the final regular
> > char.
> [...]
> 
> No, but we may as well stop going around on this, since my views are
> not going to happen.
> 
> In my view the string 'u' is a Unicode string. I don't care what sits
> underneath: 16-bits or 32-bits I don't care. As far as I'm concerned
> the string has three characters in it:
> 
> foo = u"\u4e00\u020000a"
> 
> means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==
> u"a".

I hope you meant foo = u"\u4e00\U00020000a" and foo[1] == u'\U00020000'.

(I worry that your sloppy use of variable length \u escapes above
shows that your understanding of the subject matter is less than
you've made me believe.  Please say it ain't so!)

> The fact that this is represented internally different ways shouldn't
> matter to the user who only cares about characters.

You misunderstand.  I am claiming that this shouldn't happen because
it would make u[i] an O(n) operation.  Then you brought up an argument
that suggested a way of indexing that *wouldn't* make it O(n), and
that's what I guessed (in my "Ouch" paragraph quoted above).

But what you describe now doesn't have a constant number of storage
units per character, so it has to have O(n) indexing time (unless you
assume a terribly hairy data structure).

I'm worried that you don't understand the O(n) notation, or that you
don't understand why what you are proposing would make indexing O(n).
Your suggestion of "O(1+c) for some small c" makes me *really* worried
about this.

In which case what you want ain't gonna happen, but not for the reason
you fear (BDFL decree): it's not well thought out.

> > IMO that would break an important invariant of string-like objects,
> > namely that len(s[i]) == 1.
> 
> Yes it would, which is why it isn't what I'm recommending.
> 
> > I could live with a method u.character(i) that would behave like the
> > above rule -- but not the u[i] notation.
> 
> Me to. 'nuff said. ;-)

But would u.character(i) be O(1) or O(n)?

> > But wouldn't it be enough to have a test u.issurrogate() that would
> > test if the first character of u is a valid high-surrogate?  (And
> > maybe another test u.islowsurrogate() testing for a valid
> > low-surrogate.)  Then you could write it yourself easily:
> 
> > def char(u, i):
> >     c = u[i]
> >     if c.issurrogate():
> >        c2 = u[i+1]
> >        assert c2.islowsurrogate()
> >        c = c + c2
> >     return c
> 
> Sure, as long as you check for the edge conditions. This should be in
> the library.

Note that in your above example, char(foo, 2) would not be u'a' but
would be u'\u0000', and char(foo, 3) would be u'a'.

So I still think you haven't thought this out as much as you believe.

--Guido van Rossum (home page: http://www.python.org/~guido/)