[Python-Dev] New Py_UNICODE doc

Sat May 7 01:05:38 CEST 2005

Nicholas Bastin wrote:
> 
> On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
>> Wait... are you saying a Py_UNICODE array contains either UTF-16 or
>> UTF-32 characters, but never UCS-2?  That's a big surprise to me.  I may
>> need to change my PyXPCOM patch to fit this new understanding.  I tried
>> hard to not care how Python encodes unicode characters, but details like
>> this are important when combining two frameworks with different unicode
>> APIs.
> 
> 
> Yes.  Well, in as much as a large part of UTF-16 directly overlaps
> UCS-2, then sometimes unicode strings contain UCS-2 characters. 
> However, characters which would not be legal in UCS-2 are still encoded
> properly in python, in UTF-16.
> 
> And yes, I feel your pain, that's how I *got* into this position. 
> Mapping from external unicode types is an important aspect of writing
> extension modules, and the documentation does not help people trying to
> do this.  The fact that python's internal encoding is variable is a huge
> problem in and of itself, even if that was documented properly.  This is
> why tools like Xerces and ICU will be happy to give you whatever form of
> unicode strings you want, but internally they always use UTF-16 - to
> avoid having to write two internal implementations of the same
> functionality.  If you look up and down Objects/unicodeobject.c you'll
> see a fair amount of code written a couple of different ways (using
> #ifdef's) because of the variability in the internal representation.

Ok.  Thanks for helping me understand where Python is WRT unicode.  I
can work around the issues (or maybe try to help solve them) now that I
know the current state of affairs.  If Python correctly handled UTF-16
strings internally, we wouldn't need the UCS-4 configuration switch,
would we?

Shane