[I18n-sig] How does Python Unicode treat surrogates?

Tim Peters tim.one@home.com
Mon, 25 Jun 2001 01:37:25 -0400


[M.-A. Lemburg]
> ...
> 2. What to do when slicing of Unicode strings would break
>    a surrogate pair ?

To me a string is a sequence of characters, and s[0] returns the first, s[1]
the second, and so on.  The internal details of how the implementation
chooses to torture itself <0.7 wink> should be invisible.  That is, breaking
a surrogate via slicing should be impossible:  s[i:j] returns j-i
characters, and that's that.  This implies the internal start address for
the character s[i] can't be computed as base + N*i, unless-- what? --some
fixed number B of bits >= 20 is used internally for each character.

> ...
> BTW, Python's Unicode implementation is bound to the standard
> defined at www.unicode.org; moving over to ISO 10646 is not an
> option.

I doubt that either std says anything about how an implementation represents
characters internally.  And I'm certain neither mentions Py_UNICODE at all
<wink>.