[Python-Dev] New Py_UNICODE doc
Shane Hathaway
shane at hathawaymix.org
Sat May 7 01:05:38 CEST 2005
Nicholas Bastin wrote:
>
> On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
>> Wait... are you saying a Py_UNICODE array contains either UTF-16 or
>> UTF-32 characters, but never UCS-2? That's a big surprise to me. I may
>> need to change my PyXPCOM patch to fit this new understanding. I tried
>> hard to not care how Python encodes unicode characters, but details like
>> this are important when combining two frameworks with different unicode
>> APIs.
>
>
> Yes. Well, in as much as a large part of UTF-16 directly overlaps
> UCS-2, then sometimes unicode strings contain UCS-2 characters.
> However, characters which would not be legal in UCS-2 are still encoded
> properly in python, in UTF-16.
>
> And yes, I feel your pain, that's how I *got* into this position.
> Mapping from external unicode types is an important aspect of writing
> extension modules, and the documentation does not help people trying to
> do this. The fact that python's internal encoding is variable is a huge
> problem in and of itself, even if that was documented properly. This is
> why tools like Xerces and ICU will be happy to give you whatever form of
> unicode strings you want, but internally they always use UTF-16 - to
> avoid having to write two internal implementations of the same
> functionality. If you look up and down Objects/unicodeobject.c you'll
> see a fair amount of code written a couple of different ways (using
> #ifdef's) because of the variability in the internal representation.
Ok. Thanks for helping me understand where Python is WRT unicode. I
can work around the issues (or maybe try to help solve them) now that I
know the current state of affairs. If Python correctly handled UTF-16
strings internally, we wouldn't need the UCS-4 configuration switch,
would we?
Shane
More information about the Python-Dev
mailing list