[Python-Dev] New Py_UNICODE doc
James Y Knight
foom at fuhm.net
Wed May 11 01:34:30 CEST 2005
On May 10, 2005, at 2:48 PM, Nicholas Bastin wrote:
> On May 9, 2005, at 12:59 AM, Martin v. Löwis wrote:
>
>
>>> Wow, what an inane way of looking at it. I don't know what world
>>> you
>>> live in, but in my world, users read the configure options and
>>> suppose
>>> that they mean something. In fact, they *have* to go off on
>>> their own
>>> to assume something, because even the documentation you refer to
>>> above
>>> doesn't say what happens if they choose UCS-2 or UCS-4. A logical
>>> assumption would be that python would use those CEFs internally, and
>>> that would be incorrect.
>>>
>>
>> Certainly. That's why the documentation should be improved. Changing
>> the option breaks existing packaging systems, and should not be done
>> lightly.
>>
>
> I'm perfectly happy to continue supporting --enable-unicode=ucs2,
> but not displaying it as an option. Is that acceptable to you?
>
If you're going to call python's implementation UTF-16, I'd consider
all these very serious deficiencies:
- unicodedata doesn't work for 2-char strings containing a surrogate
pairs, nor integers. Therefore it is impossible to get any data on
chars > 0xFFFF.
- there are no methods for determining if something is a surrogate
pair and turning it into a integer codepoint.
- Given that unicodedata doesn't work, I doubt also that .toupper/etc
work right on surrogate pairs, although I haven't tested.
- As has been noted before, the regexp engine doesn't properly treat
surrogate pairs as a single unit.
- Is there a method that is like unichr but that will work for
codepoints > 0xFFFF?
I'm sure there's more as well. I think it's a mistake to consider
python to be implementing UTF-16 just because it properly encodes/
decodes surrogate pairs in the UTF-8 codec.
James
More information about the Python-Dev
mailing list