[I18n-sig] How does Python Unicode treat surrogates?
Tim Peters
tim.one@home.com
Mon, 25 Jun 2001 01:37:25 -0400
[M.-A. Lemburg]
> ...
> 2. What to do when slicing of Unicode strings would break
> a surrogate pair ?
To me a string is a sequence of characters, and s[0] returns the first, s[1]
the second, and so on. The internal details of how the implementation
chooses to torture itself <0.7 wink> should be invisible. That is, breaking
a surrogate via slicing should be impossible: s[i:j] returns j-i
characters, and that's that. This implies the internal start address for
the character s[i] can't be computed as base + N*i, unless-- what? --some
fixed number B of bits >= 20 is used internally for each character.
> ...
> BTW, Python's Unicode implementation is bound to the standard
> defined at www.unicode.org; moving over to ISO 10646 is not an
> option.
I doubt that either std says anything about how an implementation represents
characters internally. And I'm certain neither mentions Py_UNICODE at all
<wink>.