[Python-Dev] Py_UNICODE madness
M.-A. Lemburg
mal at egenix.com
Wed May 4 10:39:16 CEST 2005
Nicholas Bastin wrote:
> The documentation for Py_UNICODE states the following:
>
> "This type represents a 16-bit unsigned storage type which is used by
> Python internally as basis for holding Unicode ordinals. On platforms
> where wchar_t is available and also has 16-bits, Py_UNICODE is a
> typedef alias for wchar_t to enhance native platform compatibility. On
> all other platforms, Py_UNICODE is a typedef alias for unsigned
> short."
>
> However, we have found this not to be true on at least certain RedHat
> versions (maybe all, but I'm not willing to say that at this point).
> pyconfig.h on these systems reports that PY_UNICODE_TYPE is wchar_t,
> and PY_UNICODE_SIZE is 4. Needless to say, this isn't consistent with
> the docs. It also creates quite a few problems when attempting to
> interface Python with other libraries which produce unicode data.
>
> Is this a bug, or is this behaviour intended?
It's a documentation bug. The above was true in Python 2.0 and
still is for standard Python builds. The optional 32-bit support
was added later on (in Python 2.1 IIRC) and is only used if Python
is compiled with --enable-unicode=ucs4.
Unfortunately, RedHat and others have made the UCS4 build their
default which caused and is still causing lots of problems
with Python extensions shipped as binaries, e.g. RPMs or
other packages.
> It turns out that at some point in the past, this created problems for
> tkinter as well, so someone just changed the internal unicode
> representation in tkinter to be 4 bytes as well, rather than tracking
> down the real source of the problem.
>
> Is PY_UNICODE_TYPE always going to be guaranteed to be 16 bits, or is
> it dependent on your platform? (in which case we can give up now on
> Python unicode compatibility with any other libraries).
Depends on the way Python was compiled.
> At the very
> least, if we can't guarantee the internal representation, then the
> PyUnicode_FromUnicode API needs to go away, and be replaced with
> something capable of transcoding various unicode inputs into the
> internal python representation.
We have PyUnicode_Decode() for that. PyUnicode_FromUnicode is
useful and meant for working directly on Py_UNICODE buffers.
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, May 04 2005)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::
More information about the Python-Dev
mailing list