[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Sun, 24 Jun 2001 20:16:59 +0200

"Martin v. Loewis" wrote:
> > The basic questions are:
> >
> > 1. How to treat lone surrogates (the Unicode char U+10000 is
> >    represented as the two words 0xd800 0xdc00 in UTF-16) ?
> >
> > 2. What to do when slicing of Unicode strings would break
> >    a surrogate pair ?
> >
> > 3. How to treat input data which has lone surrogate words
> >    in strings (at the start, in the middle and at the end) ?
> >
> > 4. How to process requests for creating output data from
> >    lone surrogate words ?
> I'd like to add another question
> 0. Should Py_UNICODE be extended to 32 bits?

This would mean 4 bytes per Unicode character and is
unacceptable given the fact that most of these would be 0-bytes
in practice. It would also break binary compatibility to the
native Unicode wchar_t type on e.g. Windows which we are among
the most Unicode-aware platforms there are today.
> > BTW, Python's Unicode implementation is bound to the standard
> > defined at www.unicode.org; moving over to ISO 10646 is not an
> > option.
> Can you elaborate? How can you rule out that option that easily?

It is not an option because we chose Unicode as our basis for 
i18n work and not the ISO 10646 Uniform Character Set. I'd rather
have those two camps fight over the details of the Unicode standard
than try to fix anything related to the differences between the two
in Python by mixing them.

> And why can't Python support the two standards simultaneously?

Why would you want to support two standards for the same thing ?

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/