[capi-sig] Unicode compatibility

Robert Bradshaw robertwb at math.washington.edu
Mon May 24 18:18:27 CEST 2010


On May 23, 2010, at 10:51 AM, Stefan Behnel wrote:

> Daniel Stutzbach, 21.05.2010 16:34:
>> If you try to load an extension module that:
>> - uses any of Python's Unicode functions, and
>> - was compiled by a Python with the opposite Unicode setting (UCS2  
>> vs UCS4)
>> then you get an ugly "undefined symbol" error from the linker.
>
> Well known problem, yes.
>
>
>> By default, extensions will compile in a "Unicode-agnostic" mode,  
>> where
>> Py_UNICODE is an incomplete type. The extension's code can pass  
>> Py_UNICODE
>> pointers back and forth between Python API functions, but it cannot
>> dereference them nor use sizeof(Py_UNICODE).  Unicode-agnostic  
>> modules will
>> load and run in both UCS2 and UCS4 interpreters.  Most extensions  
>> fall into
>> this category.
>
> This is a pretty bad default for Cython code. Starting with version  
> 0.13, Cython will try to infer Py_UNICODE for single character  
> unicode strings and use that whenever possible, e.g. when for- 
> looping over unicode strings and during character comparisons.  
> Making Py_UNICODE an incomplete type will render this impossible.
>
>
>> If a module needs to dereference Py_UNICODE, it can define
>> PY_REAL_PY_UNICODE before including Python.h to make Py_UNICODE a  
>> complete
>> type
>
> So that would be an option that all Cython modules (or at least  
> those that use Py_UNICODE and/or single unicode characters  
> somewhere) would use automatically. Not much to win here.
>
>
>> Attempting to load such a module into a mismatched interpreter will
>> cause an ImportError (instead of an ugly linker error).  If an  
>> extension
>> uses PY_REAL_PY_UNICODE in any .c file, it must also use it in  
>> the .c file
>> that calls PyModule_Create to ensure the Unicode width is stored in  
>> the
>> module's information.
>
> Cython modules should normally be self-contained, but it will not be  
> 100% sure that a module that wraps C code using Py_UNICODE will also  
> use Py_UNICODE somewhere, so that Cython could enable that option  
> automatically. Cython would therefore be forced to enable the option  
> for basically all code that calls into C code.
>
>
>> 2) Would you prefer the default be reversed?  i.e, that Py_UNICODE  
>> be a
>> complete type by default, and an extension must have a #define to  
>> compile in
>> Unicode-agnostic mode?
>
> Absolutely. IMHO, the only platform that always requires binaries  
> due to incomplete operating system support for source distributions  
> is MS Windows, where Py_UNICODE equals wchar_t anyway. In some  
> cases, MacOS-X is broken enough to require binary releases, too, but  
> the normal target on that platform is the system Python, which has a  
> universal setting for the Py_UNICODE size as well.
>
> So the only remaining platforms that suffer from binary  
> incompatibility problems here are Linux und Unix systems, where the  
> Py_UNICODE size differs between installations and distributions.  
> Given that these systems are best targeted with a source  
> distribution, it sounds like a bad default to complicate the usage  
> of Py_UNICODE for everyone, unless users explicitly disable this  
> behaviour. It's much better to provide this as an option for  
> extension writers who really want (or need) to provide portable  
> binary distributions for whatever reason.
>
> Personally, I think the drawbacks totally outweigh the single  
> advantage, though, so I could absolutely live without this change.  
> It's easy enough to drop the linkage error message into a web search  
> engine.

I (unsurprisingly) be against this change as well, given the reasons  
listed above, but would like to suggest some alternatives. First, is  
there a way to easily get the runtime size of Py_UNICODE? Then the  
module could be sure to raise an error itself when there's a mismatch  
before doing anything dangerous. A potentially better alternative  
would be to store record the UCS2/UCS4 distinction as part of the  
binary specification/name, with support for choosing the right one  
added into the package management infrastructures. Of course this will  
double the number of binaries, but that's just a reflection of the  
choice to make UCS2/UCS4 a binary incompatible compile time decision.

- Robert




More information about the capi-sig mailing list