[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 13:39:07 +0200

Tim Peters wrote:
> [M.-A. Lemburg]
> > ...
> > 2. What to do when slicing of Unicode strings would break
> >    a surrogate pair ?
> To me a string is a sequence of characters, and s[0] returns the first, s[1]
> the second, and so on.  The internal details of how the implementation
> chooses to torture itself <0.7 wink> should be invisible.  That is, breaking
> a surrogate via slicing should be impossible:  s[i:j] returns j-i
> characters, and that's that. 

It's not that simple: lone surrogates are true Unicode char points in
their own right; it's just that they are pretty useless without
their resp. partners in the data stream. And with this "feature"
they are in good company: the Unicode combining characters (e.g. the
combining acute) have th same property.

Hard to say what's right and wrong here... (note that I posted the
questions without an initial comment on what I think on these issues 
-- I simply don't know for sure just yet ;-)

> This implies the internal start address for
> the character s[i] can't be computed as base + N*i, unless-- what? --some
> fixed number B of bits >= 20 is used internally for each character.
> > ...
> > BTW, Python's Unicode implementation is bound to the standard
> > defined at www.unicode.org; moving over to ISO 10646 is not an
> > option.
> I doubt that either std says anything about how an implementation represents
> characters internally.  And I'm certain neither mentions Py_UNICODE at all
> <wink>.

That comment was aimed at Martin's proposal to stick with ISO 10646
for the UTF-8 codec treatment of lone surrogates. It has nothing
to do with how we store Unicode internally... (sorry for the

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/