
I'm working on http://bugs.python.org/issue8654 and I'd like to get some feedback from extension-writers, since it will impact them.
Synopsis of the problem:
If you try to load an extension module that: then you get an ugly "undefined symbol" error from the linker.
- uses any of Python's Unicode functions, and
- was compiled by a Python with the opposite Unicode setting (UCS2 vs UCS4)
For Python 3, __repr__ must return a Unicode object which means that almost all extensions will need to call some Unicode functions. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (as I discovered the hard way).
Proposed solution:
By default, extensions will compile in a "Unicode-agnostic" mode, where Py_UNICODE is an incomplete type. The extension's code can pass Py_UNICODE pointers back and forth between Python API functions, but it cannot dereference them nor use sizeof(Py_UNICODE). Unicode-agnostic modules will load and run in both UCS2 and UCS4 interpreters. Most extensions fall into this category.
If a module needs to dereference Py_UNICODE, it can define PY_REAL_PY_UNICODE before including Python.h to make Py_UNICODE a complete type, .Attempting to load such a module into a mismatched interpreter will cause an ImportError (instead of an ugly linker error). If an extension uses PY_REAL_PY_UNICODE in any .c file, it must also use it in the .c file that calls PyModule_Create to ensure the Unicode width is stored in the module's information.
I have two questions for the greater community:
Do you have any fundamental concerns with this design?
Would you prefer the default be reversed? i.e, that Py_UNICODE be a complete type by default, and an extension must have a #define to compile in Unicode-agnostic mode?
Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Daniel Stutzbach, 21.05.2010 16:34:
Well known problem, yes.
This is a pretty bad default for Cython code. Starting with version 0.13, Cython will try to infer Py_UNICODE for single character unicode strings and use that whenever possible, e.g. when for-looping over unicode strings and during character comparisons. Making Py_UNICODE an incomplete type will render this impossible.
So that would be an option that all Cython modules (or at least those that use Py_UNICODE and/or single unicode characters somewhere) would use automatically. Not much to win here.
Cython modules should normally be self-contained, but it will not be 100% sure that a module that wraps C code using Py_UNICODE will also use Py_UNICODE somewhere, so that Cython could enable that option automatically. Cython would therefore be forced to enable the option for basically all code that calls into C code.
Absolutely. IMHO, the only platform that always requires binaries due to incomplete operating system support for source distributions is MS Windows, where Py_UNICODE equals wchar_t anyway. In some cases, MacOS-X is broken enough to require binary releases, too, but the normal target on that platform is the system Python, which has a universal setting for the Py_UNICODE size as well.
So the only remaining platforms that suffer from binary incompatibility problems here are Linux und Unix systems, where the Py_UNICODE size differs between installations and distributions. Given that these systems are best targeted with a source distribution, it sounds like a bad default to complicate the usage of Py_UNICODE for everyone, unless users explicitly disable this behaviour. It's much better to provide this as an option for extension writers who really want (or need) to provide portable binary distributions for whatever reason.
Personally, I think the drawbacks totally outweigh the single advantage, though, so I could absolutely live without this change. It's easy enough to drop the linkage error message into a web search engine.
Stefan

On May 23, 2010, at 10:51 AM, Stefan Behnel wrote:
I (unsurprisingly) be against this change as well, given the reasons
listed above, but would like to suggest some alternatives. First, is
there a way to easily get the runtime size of Py_UNICODE? Then the
module could be sure to raise an error itself when there's a mismatch
before doing anything dangerous. A potentially better alternative
would be to store record the UCS2/UCS4 distinction as part of the
binary specification/name, with support for choosing the right one
added into the package management infrastructures. Of course this will
double the number of binaries, but that's just a reflection of the
choice to make UCS2/UCS4 a binary incompatible compile time decision.
- Robert

Robert, Stefan, thank you for your feedback.
How about the following variation, which I believe will address your concerns:
By default, Py_UNICODE will be a fully-specified type. In a nutshell, the default will behave just like Python 2 or 3.1, except that trying to load a mismatched module will raise an ImportError with a more helpful error message (much friendlier to novice programmers). Cython would continue to use this mode.
Extension authors who want a Unicode-agnostic build can specify an option in their setup.py that will instruct distutils to pass a -D_Py_UNICODE_AGNOSTIC compiler flag to ensure that all of their .c files are built in Unicode-independent mode. That way, the whole extension is compiled in the same mode.
It would indeed be great if package managers included the Unicode setting as part of the platform type. PJE's proposed implementation of that feature ( http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My patch greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers who like to distribute binary eggs [1]. In other words, they are complimentary solutions.
[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.
Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Daniel Stutzbach wrote:
That would be our (eGenix) preferred implementation variant as well.
Building Unicode agnostic extensions should be a feature that the extension writers turn on explicitly, rather than being the default that has to be turned off.
However, rather than using a distutils options to specify enable the agnostic mode, I would presume that extension writers simply write:
#define _Py_UNICODE_AGNOSTIC 1 #include "Python.h"
in their code and then add
[build_ext] unicode-agnostic=1
to their setup.cfg.
Rather than waiting for package managers to include support for this (I've been trying to get some awareness for this problem for years, without much success), it's probably better to just fix distutils to include a UCS2/UCS4 marker in the platform string.
[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, May 26 2010)
2010-07-19: EuroPython 2010, Birmingham, UK 53 days to go
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Wed, May 26, 2010 at 5:45 AM, M.-A. Lemburg <mal@egenix.com> wrote:
I think I was much too vague when I said "distutils option". I fear that I implied a command-line option, which is not at all what I intended. I was picturing that the module author would include something like the following in their setup.py:
Extension("foo", ["foo.c"], unicode_agnostic=True)
which would arrange to add _Py_UNICODE_AGNOSTIC to their define_macros. The module author would not (and should not) define the macro themselves at the top of a .c file. By enabling it in setup.py, we guarantee that it will be defined when compiling all of the module's .c files or not at all.
In principle, I agree. I don't personally have enough familiarity with the innards of distutils to feel comfortable writing a patch that alters the platform string.
Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Daniel Stutzbach, 21.05.2010 16:34:
Well known problem, yes.
This is a pretty bad default for Cython code. Starting with version 0.13, Cython will try to infer Py_UNICODE for single character unicode strings and use that whenever possible, e.g. when for-looping over unicode strings and during character comparisons. Making Py_UNICODE an incomplete type will render this impossible.
So that would be an option that all Cython modules (or at least those that use Py_UNICODE and/or single unicode characters somewhere) would use automatically. Not much to win here.
Cython modules should normally be self-contained, but it will not be 100% sure that a module that wraps C code using Py_UNICODE will also use Py_UNICODE somewhere, so that Cython could enable that option automatically. Cython would therefore be forced to enable the option for basically all code that calls into C code.
Absolutely. IMHO, the only platform that always requires binaries due to incomplete operating system support for source distributions is MS Windows, where Py_UNICODE equals wchar_t anyway. In some cases, MacOS-X is broken enough to require binary releases, too, but the normal target on that platform is the system Python, which has a universal setting for the Py_UNICODE size as well.
So the only remaining platforms that suffer from binary incompatibility problems here are Linux und Unix systems, where the Py_UNICODE size differs between installations and distributions. Given that these systems are best targeted with a source distribution, it sounds like a bad default to complicate the usage of Py_UNICODE for everyone, unless users explicitly disable this behaviour. It's much better to provide this as an option for extension writers who really want (or need) to provide portable binary distributions for whatever reason.
Personally, I think the drawbacks totally outweigh the single advantage, though, so I could absolutely live without this change. It's easy enough to drop the linkage error message into a web search engine.
Stefan

On May 23, 2010, at 10:51 AM, Stefan Behnel wrote:
I (unsurprisingly) be against this change as well, given the reasons
listed above, but would like to suggest some alternatives. First, is
there a way to easily get the runtime size of Py_UNICODE? Then the
module could be sure to raise an error itself when there's a mismatch
before doing anything dangerous. A potentially better alternative
would be to store record the UCS2/UCS4 distinction as part of the
binary specification/name, with support for choosing the right one
added into the package management infrastructures. Of course this will
double the number of binaries, but that's just a reflection of the
choice to make UCS2/UCS4 a binary incompatible compile time decision.
- Robert

Robert, Stefan, thank you for your feedback.
How about the following variation, which I believe will address your concerns:
By default, Py_UNICODE will be a fully-specified type. In a nutshell, the default will behave just like Python 2 or 3.1, except that trying to load a mismatched module will raise an ImportError with a more helpful error message (much friendlier to novice programmers). Cython would continue to use this mode.
Extension authors who want a Unicode-agnostic build can specify an option in their setup.py that will instruct distutils to pass a -D_Py_UNICODE_AGNOSTIC compiler flag to ensure that all of their .c files are built in Unicode-independent mode. That way, the whole extension is compiled in the same mode.
It would indeed be great if package managers included the Unicode setting as part of the platform type. PJE's proposed implementation of that feature ( http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My patch greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers who like to distribute binary eggs [1]. In other words, they are complimentary solutions.
[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.
Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Daniel Stutzbach wrote:
That would be our (eGenix) preferred implementation variant as well.
Building Unicode agnostic extensions should be a feature that the extension writers turn on explicitly, rather than being the default that has to be turned off.
However, rather than using a distutils options to specify enable the agnostic mode, I would presume that extension writers simply write:
#define _Py_UNICODE_AGNOSTIC 1 #include "Python.h"
in their code and then add
[build_ext] unicode-agnostic=1
to their setup.cfg.
Rather than waiting for package managers to include support for this (I've been trying to get some awareness for this problem for years, without much success), it's probably better to just fix distutils to include a UCS2/UCS4 marker in the platform string.
[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, May 26 2010)
2010-07-19: EuroPython 2010, Birmingham, UK 53 days to go
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Wed, May 26, 2010 at 5:45 AM, M.-A. Lemburg <mal@egenix.com> wrote:
I think I was much too vague when I said "distutils option". I fear that I implied a command-line option, which is not at all what I intended. I was picturing that the module author would include something like the following in their setup.py:
Extension("foo", ["foo.c"], unicode_agnostic=True)
which would arrange to add _Py_UNICODE_AGNOSTIC to their define_macros. The module author would not (and should not) define the macro themselves at the top of a .c file. By enabling it in setup.py, we guarantee that it will be defined when compiling all of the module's .c files or not at all.
In principle, I agree. I don't personally have enough familiarity with the innards of distutils to feel comfortable writing a patch that alters the platform string.
Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
participants (4)
-
Daniel Stutzbach
-
M.-A. Lemburg
-
Robert Bradshaw
-
Stefan Behnel