[Python-Dev] New Py_UNICODE doc

Sat May 7 00:53:24 CEST 2005

On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:

> Nicholas Bastin wrote:
>> On May 6, 2005, at 3:42 PM, James Y Knight wrote:
>>> It means all the string operations treat strings as if they were
>>> UCS-2, but that in actuality, they are UTF-16. Same as the case in 
>>> the
>>> windows APIs and Java. That is, all string operations are essentially
>>> broken, because they're operating on encoded bytes, not characters,
>>> but claim to be operating on characters.
>>
>>
>> Well, this is a completely separate issue/problem. The internal
>> representation is UTF-16, and should be stated as such.  If the
>> built-in methods actually don't work with surrogate pairs, then that
>> should be fixed.
>
> Wait... are you saying a Py_UNICODE array contains either UTF-16 or
> UTF-32 characters, but never UCS-2?  That's a big surprise to me.  I 
> may
> need to change my PyXPCOM patch to fit this new understanding.  I tried
> hard to not care how Python encodes unicode characters, but details 
> like
> this are important when combining two frameworks with different unicode
> APIs.

Yes.  Well, in as much as a large part of UTF-16 directly overlaps 
UCS-2, then sometimes unicode strings contain UCS-2 characters.  
However, characters which would not be legal in UCS-2 are still encoded 
properly in python, in UTF-16.

And yes, I feel your pain, that's how I *got* into this position.  
Mapping from external unicode types is an important aspect of writing 
extension modules, and the documentation does not help people trying to 
do this.  The fact that python's internal encoding is variable is a 
huge problem in and of itself, even if that was documented properly.  
This is why tools like Xerces and ICU will be happy to give you 
whatever form of unicode strings you want, but internally they always 
use UTF-16 - to avoid having to write two internal implementations of 
the same functionality.  If you look up and down 
Objects/unicodeobject.c you'll see a fair amount of code written a 
couple of different ways (using #ifdef's) because of the variability in 
the internal representation.

--
Nick