[capi-sig] Unicode compatibility
robertwb at math.washington.edu
Mon May 24 18:18:27 CEST 2010
On May 23, 2010, at 10:51 AM, Stefan Behnel wrote:
> Daniel Stutzbach, 21.05.2010 16:34:
>> If you try to load an extension module that:
>> - uses any of Python's Unicode functions, and
>> - was compiled by a Python with the opposite Unicode setting (UCS2
>> vs UCS4)
>> then you get an ugly "undefined symbol" error from the linker.
> Well known problem, yes.
>> By default, extensions will compile in a "Unicode-agnostic" mode,
>> Py_UNICODE is an incomplete type. The extension's code can pass
>> pointers back and forth between Python API functions, but it cannot
>> dereference them nor use sizeof(Py_UNICODE). Unicode-agnostic
>> modules will
>> load and run in both UCS2 and UCS4 interpreters. Most extensions
>> fall into
>> this category.
> This is a pretty bad default for Cython code. Starting with version
> 0.13, Cython will try to infer Py_UNICODE for single character
> unicode strings and use that whenever possible, e.g. when for-
> looping over unicode strings and during character comparisons.
> Making Py_UNICODE an incomplete type will render this impossible.
>> If a module needs to dereference Py_UNICODE, it can define
>> PY_REAL_PY_UNICODE before including Python.h to make Py_UNICODE a
> So that would be an option that all Cython modules (or at least
> those that use Py_UNICODE and/or single unicode characters
> somewhere) would use automatically. Not much to win here.
>> Attempting to load such a module into a mismatched interpreter will
>> cause an ImportError (instead of an ugly linker error). If an
>> uses PY_REAL_PY_UNICODE in any .c file, it must also use it in
>> the .c file
>> that calls PyModule_Create to ensure the Unicode width is stored in
>> module's information.
> Cython modules should normally be self-contained, but it will not be
> 100% sure that a module that wraps C code using Py_UNICODE will also
> use Py_UNICODE somewhere, so that Cython could enable that option
> automatically. Cython would therefore be forced to enable the option
> for basically all code that calls into C code.
>> 2) Would you prefer the default be reversed? i.e, that Py_UNICODE
>> be a
>> complete type by default, and an extension must have a #define to
>> compile in
>> Unicode-agnostic mode?
> Absolutely. IMHO, the only platform that always requires binaries
> due to incomplete operating system support for source distributions
> is MS Windows, where Py_UNICODE equals wchar_t anyway. In some
> cases, MacOS-X is broken enough to require binary releases, too, but
> the normal target on that platform is the system Python, which has a
> universal setting for the Py_UNICODE size as well.
> So the only remaining platforms that suffer from binary
> incompatibility problems here are Linux und Unix systems, where the
> Py_UNICODE size differs between installations and distributions.
> Given that these systems are best targeted with a source
> distribution, it sounds like a bad default to complicate the usage
> of Py_UNICODE for everyone, unless users explicitly disable this
> behaviour. It's much better to provide this as an option for
> extension writers who really want (or need) to provide portable
> binary distributions for whatever reason.
> Personally, I think the drawbacks totally outweigh the single
> advantage, though, so I could absolutely live without this change.
> It's easy enough to drop the linkage error message into a web search
I (unsurprisingly) be against this change as well, given the reasons
listed above, but would like to suggest some alternatives. First, is
there a way to easily get the runtime size of Py_UNICODE? Then the
module could be sure to raise an error itself when there's a mismatch
before doing anything dangerous. A potentially better alternative
would be to store record the UCS2/UCS4 distinction as part of the
binary specification/name, with support for choosing the right one
added into the package management infrastructures. Of course this will
double the number of binaries, but that's just a reflection of the
choice to make UCS2/UCS4 a binary incompatible compile time decision.
More information about the capi-sig