[Python-Dev] New Py_UNICODE doc

Shane Hathaway shane at hathawaymix.org
Fri May 6 23:21:56 CEST 2005


Nicholas Bastin wrote:
> On May 6, 2005, at 3:42 PM, James Y Knight wrote:
>>It means all the string operations treat strings as if they were 
>>UCS-2, but that in actuality, they are UTF-16. Same as the case in the 
>>windows APIs and Java. That is, all string operations are essentially 
>>broken, because they're operating on encoded bytes, not characters, 
>>but claim to be operating on characters.
> 
> 
> Well, this is a completely separate issue/problem. The internal 
> representation is UTF-16, and should be stated as such.  If the 
> built-in methods actually don't work with surrogate pairs, then that 
> should be fixed.

Wait... are you saying a Py_UNICODE array contains either UTF-16 or
UTF-32 characters, but never UCS-2?  That's a big surprise to me.  I may
need to change my PyXPCOM patch to fit this new understanding.  I tried
hard to not care how Python encodes unicode characters, but details like
this are important when combining two frameworks with different unicode
APIs.

Shane


More information about the Python-Dev mailing list