[I18n-sig] How does Python Unicode treat surrogates?
Tom Emerson
tree@basistech.com
Mon, 25 Jun 2001 14:12:17 -0400
Guido van Rossum writes:
> Ouch. So suppose we have a string u containing four items: a regular
> 16-bit char, a high surrogate, a low surrogate, and another regular
> 16-bit char. You're saying that u[0] should return the first
> character, u[1] the entire surrogate (so it would still be a 2-item
> string), u[2] I gues the empty string, and u[3] the final regular
> char.
[...]
No, but we may as well stop going around on this, since my views are
not going to happen.
In my view the string 'u' is a Unicode string. I don't care what sits
underneath: 16-bits or 32-bits I don't care. As far as I'm concerned
the string has three characters in it:
foo = u"\u4e00\u020000a"
means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==
u"a".
The fact that this is represented internally different ways shouldn't
matter to the user who only cares about characters.
> IMO that would break an important invariant of string-like objects,
> namely that len(s[i]) == 1.
Yes it would, which is why it isn't what I'm recommending.
> I could live with a method u.character(i) that would behave like the
> above rule -- but not the u[i] notation.
Me to. 'nuff said. ;-)
> But wouldn't it be enough to have a test u.issurrogate() that would
> test if the first character of u is a valid high-surrogate? (And
> maybe another test u.islowsurrogate() testing for a valid
> low-surrogate.) Then you could write it yourself easily:
> def char(u, i):
> c = u[i]
> if c.issurrogate():
> c2 = u[i+1]
> assert c2.islowsurrogate()
> c = c + c2
> return c
Sure, as long as you check for the edge conditions. This should be in
the library.
--
Tom Emerson Basis Technology Corp.
Sr. Sinostringologist http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"