[Python-Dev] Py_UNICODE madness

Nicholas Bastin nbastin at opnet.com
Wed May 4 00:36:23 CEST 2005


The documentation for Py_UNICODE states the following:

"This type represents a 16-bit unsigned storage type which is used by  
Python internally as basis for holding Unicode ordinals. On platforms 
where wchar_t is available and also has 16-bits,  Py_UNICODE is a 
typedef alias for wchar_t to enhance  native platform compatibility. On 
all other platforms,  Py_UNICODE is a typedef alias for unsigned 
short."

However, we have found this not to be true on at least certain RedHat 
versions (maybe all, but I'm not willing to say that at this point).  
pyconfig.h on these systems reports that PY_UNICODE_TYPE is wchar_t, 
and PY_UNICODE_SIZE is 4.  Needless to say, this isn't consistent with 
the docs.  It also creates quite a few problems when attempting to 
interface Python with other libraries which produce unicode data.

Is this a bug, or is this behaviour intended?

It turns out that at some point in the past, this created problems for 
tkinter as well, so someone just changed the internal unicode 
representation in tkinter to be 4 bytes as well, rather than tracking 
down the real source of the problem.

Is PY_UNICODE_TYPE always going to be guaranteed to be 16 bits, or is 
it dependent on your platform? (in which case we can give up now on 
Python unicode compatibility with any other libraries).  At the very 
least, if we can't guarantee the internal representation, then the 
PyUnicode_FromUnicode API needs to go away, and be replaced with 
something capable of transcoding various unicode inputs into the 
internal python representation.

--
Nick



More information about the Python-Dev mailing list