[Python-Dev] Py_UNICODE madness
Nicholas Bastin
nbastin at opnet.com
Wed May 4 00:36:23 CEST 2005
The documentation for Py_UNICODE states the following:
"This type represents a 16-bit unsigned storage type which is used by
Python internally as basis for holding Unicode ordinals. On platforms
where wchar_t is available and also has 16-bits, Py_UNICODE is a
typedef alias for wchar_t to enhance native platform compatibility. On
all other platforms, Py_UNICODE is a typedef alias for unsigned
short."
However, we have found this not to be true on at least certain RedHat
versions (maybe all, but I'm not willing to say that at this point).
pyconfig.h on these systems reports that PY_UNICODE_TYPE is wchar_t,
and PY_UNICODE_SIZE is 4. Needless to say, this isn't consistent with
the docs. It also creates quite a few problems when attempting to
interface Python with other libraries which produce unicode data.
Is this a bug, or is this behaviour intended?
It turns out that at some point in the past, this created problems for
tkinter as well, so someone just changed the internal unicode
representation in tkinter to be 4 bytes as well, rather than tracking
down the real source of the problem.
Is PY_UNICODE_TYPE always going to be guaranteed to be 16 bits, or is
it dependent on your platform? (in which case we can give up now on
Python unicode compatibility with any other libraries). At the very
least, if we can't guarantee the internal representation, then the
PyUnicode_FromUnicode API needs to go away, and be replaced with
something capable of transcoding various unicode inputs into the
internal python representation.
--
Nick
More information about the Python-Dev
mailing list