[Python-Dev] New Py_UNICODE doc

Sat May 7 01:40:01 CEST 2005

On May 6, 2005, at 7:05 PM, Shane Hathaway wrote:

> Nicholas Bastin wrote:
>
>> On May 6, 2005, at 5:21 PM, Shane Hathaway wrote:
>>
>>> Wait... are you saying a Py_UNICODE array contains either UTF-16 or
>>> UTF-32 characters, but never UCS-2?  That's a big surprise to  
>>> me.  I may
>>> need to change my PyXPCOM patch to fit this new understanding.  I  
>>> tried
>>> hard to not care how Python encodes unicode characters, but  
>>> details like
>>> this are important when combining two frameworks with different  
>>> unicode
>>> APIs.
>>
>> Yes.  Well, in as much as a large part of UTF-16 directly overlaps
>> UCS-2, then sometimes unicode strings contain UCS-2 characters.
>> However, characters which would not be legal in UCS-2 are still  
>> encoded
>> properly in python, in UTF-16.
>>
>> And yes, I feel your pain, that's how I *got* into this position.
>> Mapping from external unicode types is an important aspect of writing
>> extension modules, and the documentation does not help people  
>> trying to
>> do this.  The fact that python's internal encoding is variable is  
>> a huge
>> problem in and of itself, even if that was documented properly.   
>> This is
>> why tools like Xerces and ICU will be happy to give you whatever  
>> form of
>> unicode strings you want, but internally they always use UTF-16 - to
>> avoid having to write two internal implementations of the same
>> functionality.  If you look up and down Objects/unicodeobject.c  
>> you'll
>> see a fair amount of code written a couple of different ways (using
>> #ifdef's) because of the variability in the internal representation.
>>
>
> Ok.  Thanks for helping me understand where Python is WRT unicode.  I
> can work around the issues (or maybe try to help solve them) now  
> that I
> know the current state of affairs.  If Python correctly handled UTF-16
> strings internally, we wouldn't need the UCS-4 configuration switch,
> would we?

Personally I would rather see Python (3000) grow a new way to  
represent strings, more along the lines of the way it's typically  
done in Objective-C.  I wrote a little bit about that works here:

http://bob.pythonmac.org/archives/2005/04/04/pyobjc-and-unicode/

Effectively, instead of having One And Only One Way To Store Text,  
you would have one and only one base class (say basestring) that has  
some "virtual" methods that know how to deal with text.  Then, you  
have several concrete implementations that implements those functions  
for its particular backing store (and possibly encoding, but that  
might be implicit with the backing store.. i.e. if its an ASCII,  
UCS-2 or UCS-4 backing store).  Currently we more or less have this  
at the Python level, between str and unicode, but certainly not at  
the C API.

-bob