[Python-Dev] New Py_UNICODE doc
Nicholas Bastin
nbastin at opnet.com
Sat May 7 00:53:24 CEST 2005
On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
> Nicholas Bastin wrote:
>> On May 6, 2005, at 3:42 PM, James Y Knight wrote:
>>> It means all the string operations treat strings as if they were
>>> UCS-2, but that in actuality, they are UTF-16. Same as the case in
>>> the
>>> windows APIs and Java. That is, all string operations are essentially
>>> broken, because they're operating on encoded bytes, not characters,
>>> but claim to be operating on characters.
>>
>>
>> Well, this is a completely separate issue/problem. The internal
>> representation is UTF-16, and should be stated as such. If the
>> built-in methods actually don't work with surrogate pairs, then that
>> should be fixed.
>
> Wait... are you saying a Py_UNICODE array contains either UTF-16 or
> UTF-32 characters, but never UCS-2? That's a big surprise to me. I
> may
> need to change my PyXPCOM patch to fit this new understanding. I tried
> hard to not care how Python encodes unicode characters, but details
> like
> this are important when combining two frameworks with different unicode
> APIs.
Yes. Well, in as much as a large part of UTF-16 directly overlaps
UCS-2, then sometimes unicode strings contain UCS-2 characters.
However, characters which would not be legal in UCS-2 are still encoded
properly in python, in UTF-16.
And yes, I feel your pain, that's how I *got* into this position.
Mapping from external unicode types is an important aspect of writing
extension modules, and the documentation does not help people trying to
do this. The fact that python's internal encoding is variable is a
huge problem in and of itself, even if that was documented properly.
This is why tools like Xerces and ICU will be happy to give you
whatever form of unicode strings you want, but internally they always
use UTF-16 - to avoid having to write two internal implementations of
the same functionality. If you look up and down
Objects/unicodeobject.c you'll see a fair amount of code written a
couple of different ways (using #ifdef's) because of the variability in
the internal representation.
--
Nick
More information about the Python-Dev
mailing list