[Python-Dev] New Py_UNICODE doc
Nicholas Bastin
nbastin at opnet.com
Fri May 6 22:21:53 CEST 2005
On May 6, 2005, at 3:42 PM, James Y Knight wrote:
> On May 6, 2005, at 2:49 PM, Nicholas Bastin wrote:
>> If this is the case, then we're clearly misleading users. If the
>> configure script says UCS-2, then as a user I would assume that
>> surrogate pairs would *not* be encoded, because I chose UCS-2, and it
>> doesn't support that. I would assume that any UTF-16 string I would
>> read would be transcoded into the internal type (UCS-2), and
>> information would be lost. If this is not the case, then what does
>> the
>> configure option mean?
>
> It means all the string operations treat strings as if they were
> UCS-2, but that in actuality, they are UTF-16. Same as the case in the
> windows APIs and Java. That is, all string operations are essentially
> broken, because they're operating on encoded bytes, not characters,
> but claim to be operating on characters.
Well, this is a completely separate issue/problem. The internal
representation is UTF-16, and should be stated as such. If the
built-in methods actually don't work with surrogate pairs, then that
should be fixed.
--
Nick
More information about the Python-Dev
mailing list