[capi-sig] Unicode compatibility
daniel at stutzbachenterprises.com
Fri May 21 16:34:25 CEST 2010
I'm working on http://bugs.python.org/issue8654 and I'd like to get some
feedback from extension-writers, since it will impact them.
Synopsis of the problem:
If you try to load an extension module that:
- uses any of Python's Unicode functions, and
- was compiled by a Python with the opposite Unicode setting (UCS2 vs UCS4)
then you get an ugly "undefined symbol" error from the linker.
For Python 3, __repr__ must return a Unicode object which means that almost
all extensions will need to call some Unicode functions. It's basically
fruitless to upload a binary egg for Python 3 to PyPi, since it will
generate link errors for a large fraction of downloaders (as I discovered
the hard way).
By default, extensions will compile in a "Unicode-agnostic" mode, where
Py_UNICODE is an incomplete type. The extension's code can pass Py_UNICODE
pointers back and forth between Python API functions, but it cannot
dereference them nor use sizeof(Py_UNICODE). Unicode-agnostic modules will
load and run in both UCS2 and UCS4 interpreters. Most extensions fall into
If a module needs to dereference Py_UNICODE, it can define
PY_REAL_PY_UNICODE before including Python.h to make Py_UNICODE a complete
type, .Attempting to load such a module into a mismatched interpreter will
cause an ImportError (instead of an ugly linker error). If an extension
uses PY_REAL_PY_UNICODE in any .c file, it must also use it in the .c file
that calls PyModule_Create to ensure the Unicode width is stored in the
I have two questions for the greater community:
1) Do you have any fundamental concerns with this design?
2) Would you prefer the default be reversed? i.e, that Py_UNICODE be a
complete type by default, and an extension must have a #define to compile in
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
More information about the capi-sig