[I18n-sig] How does Python Unicode treat surrogates?
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Sun, 24 Jun 2001 19:03:33 +0200
> The basic questions are:
>
> 1. How to treat lone surrogates (the Unicode char U+10000 is
> represented as the two words 0xd800 0xdc00 in UTF-16) ?
>
> 2. What to do when slicing of Unicode strings would break
> a surrogate pair ?
>
> 3. How to treat input data which has lone surrogate words
> in strings (at the start, in the middle and at the end) ?
>
> 4. How to process requests for creating output data from
> lone surrogate words ?
I'd like to add another question
0. Should Py_UNICODE be extended to 32 bits?
> BTW, Python's Unicode implementation is bound to the standard
> defined at www.unicode.org; moving over to ISO 10646 is not an
> option.
Can you elaborate? How can you rule out that option that easily?
And why can't Python support the two standards simultaneously?
Regards,
Martin