[Python-Dev] New Py_UNICODE doc

Wed May 11 01:34:30 CEST 2005

On May 10, 2005, at 2:48 PM, Nicholas Bastin wrote:
> On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote:
>
>
>>> Wow, what an inane way of looking at it.  I don't know what world  
>>> you
>>> live in, but in my world, users read the configure options and  
>>> suppose
>>> that they mean something.  In fact, they *have* to go off on  
>>> their own
>>> to assume something, because even the documentation you refer to  
>>> above
>>> doesn't say what happens if they choose UCS-2 or UCS-4.  A logical
>>> assumption would be that python would use those CEFs internally, and
>>> that would be incorrect.
>>>
>>
>> Certainly. That's why the documentation should be improved. Changing
>> the option breaks existing packaging systems, and should not be done
>> lightly.
>>
>
> I'm perfectly happy to continue supporting --enable-unicode=ucs2,  
> but not displaying it as an option.  Is that acceptable to you?
>

If you're going to call python's implementation UTF-16, I'd consider  
all these very serious deficiencies:
- unicodedata doesn't work for 2-char strings containing a surrogate  
pairs, nor integers. Therefore it is impossible to get any data on  
chars > 0xFFFF.
- there are no methods for determining if something is a surrogate  
pair and turning it into a integer codepoint.
- Given that unicodedata doesn't work, I doubt also that .toupper/etc  
work right on surrogate pairs, although I haven't tested.
- As has been noted before, the regexp engine doesn't properly treat  
surrogate pairs as a single unit.
- Is there a method that is like unichr but that will work for  
codepoints > 0xFFFF?

I'm sure there's more as well. I think it's a mistake to consider  
python to be implementing UTF-16 just because it properly encodes/ 
decodes surrogate pairs in the UTF-8 codec.

James