[I18n-sig] How does Python Unicode treat surrogates?

Guido van Rossum guido@digicool.com
Mon, 25 Jun 2001 14:42:24 -0400


> No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF)
> then look at the n+1'th character for a valid low-surrogate. If the
> n'th character is a valid low-surrogate and the n-1'th character is a
> valid high-surrogate, then skip it.

Ouch.  So suppose we have a string u containing four items: a regular
16-bit char, a high surrogate, a low surrogate, and another regular
16-bit char.  You're saying that u[0] should return the first
character, u[1] the entire surrogate (so it would still be a 2-item
string), u[2] I gues the empty string, and u[3] the final regular
char.

IMO that would break an important invariant of string-like objects,
namely that len(s[i]) == 1.

I could live with a method u.character(i) that would behave like the
above rule -- but not the u[i] notation.

But wouldn't it be enough to have a test u.issurrogate() that would
test if the first character of u is a valid high-surrogate?  (And
maybe another test u.islowsurrogate() testing for a valid
low-surrogate.)  Then you could write it yourself easily:

def char(u, i):
    c = u[i]
    if c.issurrogate():
       c2 = u[i+1]
       assert c2.islowsurrogate()
       c = c + c2
    return c

(Don't pay attention to the method names I'm proposing -- that's for a
separate subcommittee. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)