[Python-Dev] Py_UNICODE madness

M.-A. Lemburg mal at egenix.com
Wed May 4 10:39:16 CEST 2005


Nicholas Bastin wrote:
> The documentation for Py_UNICODE states the following:
> 
> "This type represents a 16-bit unsigned storage type which is used by  
> Python internally as basis for holding Unicode ordinals. On platforms 
> where wchar_t is available and also has 16-bits,  Py_UNICODE is a 
> typedef alias for wchar_t to enhance  native platform compatibility. On 
> all other platforms,  Py_UNICODE is a typedef alias for unsigned 
> short."
> 
> However, we have found this not to be true on at least certain RedHat 
> versions (maybe all, but I'm not willing to say that at this point).  
> pyconfig.h on these systems reports that PY_UNICODE_TYPE is wchar_t, 
> and PY_UNICODE_SIZE is 4.  Needless to say, this isn't consistent with 
> the docs.  It also creates quite a few problems when attempting to 
> interface Python with other libraries which produce unicode data.
> 
> Is this a bug, or is this behaviour intended?

It's a documentation bug. The above was true in Python 2.0 and
still is for standard Python builds. The optional 32-bit support
was added later on (in Python 2.1 IIRC) and is only used if Python
is compiled with --enable-unicode=ucs4.

Unfortunately, RedHat and others have made the UCS4 build their
default which caused and is still causing lots of problems
with Python extensions shipped as binaries, e.g. RPMs or
other packages.

> It turns out that at some point in the past, this created problems for 
> tkinter as well, so someone just changed the internal unicode 
> representation in tkinter to be 4 bytes as well, rather than tracking 
> down the real source of the problem.
> 
> Is PY_UNICODE_TYPE always going to be guaranteed to be 16 bits, or is 
> it dependent on your platform? (in which case we can give up now on 
> Python unicode compatibility with any other libraries).  

Depends on the way Python was compiled.

> At the very 
> least, if we can't guarantee the internal representation, then the 
> PyUnicode_FromUnicode API needs to go away, and be replaced with 
> something capable of transcoding various unicode inputs into the 
> internal python representation.

We have PyUnicode_Decode() for that. PyUnicode_FromUnicode is
useful and meant for working directly on Py_UNICODE buffers.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 04 2005)
 >>> Python/Zope Consulting and Support ...        http://www.egenix.com/
 >>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
 >>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::


More information about the Python-Dev mailing list