[I18n-sig] How does Python Unicode treat surrogates?

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Sun, 24 Jun 2001 19:03:33 +0200


> The basic questions are:
> 
> 1. How to treat lone surrogates (the Unicode char U+10000 is
>    represented as the two words 0xd800 0xdc00 in UTF-16) ?
> 
> 2. What to do when slicing of Unicode strings would break
>    a surrogate pair ?
> 
> 3. How to treat input data which has lone surrogate words 
>    in strings (at the start, in the middle and at the end) ?
> 
> 4. How to process requests for creating output data from 
>    lone surrogate words ?

I'd like to add another question

0. Should Py_UNICODE be extended to 32 bits?

> BTW, Python's Unicode implementation is bound to the standard
> defined at www.unicode.org; moving over to ISO 10646 is not an
> option.

Can you elaborate? How can you rule out that option that easily?
And why can't Python support the two standards simultaneously?

Regards,
Martin