[I18n-sig] How does Python Unicode treat surrogates?

Tom Emerson tree@basistech.com
Mon, 25 Jun 2001 14:12:17 -0400

Guido van Rossum writes:
> Ouch.  So suppose we have a string u containing four items: a regular
> 16-bit char, a high surrogate, a low surrogate, and another regular
> 16-bit char.  You're saying that u[0] should return the first
> character, u[1] the entire surrogate (so it would still be a 2-item
> string), u[2] I gues the empty string, and u[3] the final regular
> char.

No, but we may as well stop going around on this, since my views are
not going to happen.

In my view the string 'u' is a Unicode string. I don't care what sits
underneath: 16-bits or 32-bits I don't care. As far as I'm concerned
the string has three characters in it:

foo = u"\u4e00\u020000a"

means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==

The fact that this is represented internally different ways shouldn't
matter to the user who only cares about characters.

> IMO that would break an important invariant of string-like objects,
> namely that len(s[i]) == 1.

Yes it would, which is why it isn't what I'm recommending.

> I could live with a method u.character(i) that would behave like the
> above rule -- but not the u[i] notation.

Me to. 'nuff said. ;-)

> But wouldn't it be enough to have a test u.issurrogate() that would
> test if the first character of u is a valid high-surrogate?  (And
> maybe another test u.islowsurrogate() testing for a valid
> low-surrogate.)  Then you could write it yourself easily:

> def char(u, i):
>     c = u[i]
>     if c.issurrogate():
>        c2 = u[i+1]
>        assert c2.islowsurrogate()
>        c = c + c2
>     return c

Sure, as long as you check for the edge conditions. This should be in
the library.

Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"