
Scope ----- This idea affects the CPython ABI for extension modules. It has no impact on the Python language syntax nor other Python implementations. The Problem ----------- Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. The two are binary incompatible, but the distinction is not included as part of the platform name. Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it: undefined symbol: PyUnicodeUCS2_FromString In Python 2, some extension modules can blissfully link to either ABI, as the problem only arises for modules that call a PyUnicode_* macro (which expands to calling either a PyUnicodeUCS2_* or PyUnicodeUCS4_* function). For Python 3, every extension type will need to call a PyUnicode_* macro, since __repr__ must return a Unicode object. This problem has been known since at least 2006, as seen in this thread from the distutils-sig: http://markmail.org/message/bla5vrwlv3kn3n7e?q=thread:bla5vrwlv3kn3n7e In that thread, it was suggested that the Unicode representation become part of the platform name. That change would require a distutils and/or setuptools change, which has not happened and does not appear likely to happen in the near future. It would also mean that anyone who wants to provide binary eggs for common platforms will need to provide twice as many eggs. Solution -------- Get rid of the ABI difference for the 99% of extension modules that don't care about the internal representation of Unicode strings. From the extension module's point of view, PyObject is opaque. It will manipulate the Unicode string entirely through PyUnicode_* function calls and does not care about the internal representation. For example, PyUnicode_FromString has the following signature in the documentation: PyObject *PyUnicode_FromString(const char *u) Currently, it's #ifdef'ed to either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString. Remove the macro and name the function PyUnicode_FromString regardless of which internal representation is being used. The vast majority of binary eggs will then work correctly on both UCS2 and UCS4 Pythons. Functions that explicitly use Py_UNICODE or PyUnicodeObject as part of their signature will continue to be #ifdef'ed, so extension modules that *do* care about the internal representation will still generate a link error. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

On Mon, Nov 2, 2009 at 8:53 AM, Daniel Stutzbach <daniel@stutzbachenterprises.com> wrote:
IIUC your proposal doesn't get rid of the root of the problem (that there are two incompatible choices for Unicode string representation) but only proposes that there be a purely "abstract" API for working with string objects, which, if used religiously by extension modules, would allow them to be linked with either family of runtimes. This sounds attractive, but I kind of doubt that changing a single API is sufficient. Perhaps it would be useful to do a kind of review or survey of how many Unicode APIs are used by the typical extension? -- --Guido van Rossum (python.org/~guido)

On Mon, Nov 2, 2009 at 11:34 AM, Guido van Rossum <guido@python.org> wrote:
I made an editing error. I meant to suggest altering all the PyUnicode_* macro/functions, except those that explicitly use Py_UNICODE or PyUnicodeObject in their signature. PyUnicode_FromString was just an example. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

On Mon, Nov 2, 2009 at 9:45 AM, Daniel Stutzbach <daniel@stutzbachenterprises.com> wrote:
We'd also have to hide the macros that can be used to access the internals of a PyUnicodeObject, in order for that approach to be safe. Basically, an extension would have to include a second header file to use those macros and it would have to somehow indicate to the linker that it is using UCS2 or UCS4 internals as well. I would want to err on the safe side here -- if it was at all easy to create an extension that *seems* to be ABI-neutral but *actually* relies on knowledge about the UCS2 or UCS4 representation, we'd be creating a worse problem. Users don't like stuff not working, but they *really* don't like stuff crashing with random core dumps -- if it has to be broken, let it break very loudly and explicitly. The current approach satisfies that requirement -- it probably just errs too far on the "never assume it might work" side. -- --Guido van Rossum (python.org/~guido)

On Mon, Nov 2, 2009 at 11:57 AM, Guido van Rossum <guido@python.org> wrote:
I don't know of a portable way to indicate that to the linker simply by including a header file. I wish I did. Here is one idea that will cause a linker error if there's a mismatch and one of the macros are used. It does cause the macro to execute an extra CPU instruction or two, though. In unicodeobject.h: /* Require the macro to reference a global variable that will only be present if the Unicode ABI matches correctly. Arrange for the global variable to always have the value zero, and add it to the return value of the macro. */ #if Py_UNICODE_SIZE == 4 extern const int Py_UnicodeZero_UCS4; #define Py_UNICODE_ZERO (Py_UnicodeZero_UCS4) #else extern const int Py_UnicodeZero_UCS2; #define Py_UNICODE_ZERO (Py_UnicodeZero_UCS2) #endif #define PyUnicode_AS_UNICODE(op) \ (Py_UNICODE_ZERO + (((PyUnicodeObject *)(op))->str)) In unicodeobject.c: extern const int Py_UNICODE_ZERO = 0;
Agreed. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Daniel Stutzbach, 02.11.2009 17:53:
Isn't that the main issue here? IMHO, if EasyInstall was fixed to distinguish extensions for UCS2/UCS4 platforms, that would just make the issue go away for most users. Not for extension builders and package maintainers, admittedly, but certainly for most users. Stefan

On Thu, Nov 5, 2009 at 1:35 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
If easy_install were fixed in the way suggested by PJE [1], eggs could effectively be labeled as "UCS2", "UCS4", or "Don't Care". Right now, all eggs are essentially labeled "Don't Care", even if they will fail to link. My proposal would greatly expand the number of eggs that can legitimately be labeled "Don't Care". It's a complementary proposal; fixing easy_install is certainly still important. :-) [1] http://bit.ly/1bO62 -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Please see also this thread: http://www.mail-archive.com/python-dev@python.org/msg42272.html This is a complementary proposal: that the Python devs should encourage the Linux distributors to converge on a common UCS2/4 choice. If the ABI improvement that you suggest is not adopted, then my proposal will help users. If the ABI improvement that you suggest is adopted, then my proposal will still help users. Likewise with the proposal to include the UCS2/4 configuration in the platform description on Linux: http://bugs.python.org/setuptools/ issue78 . If that proposal is not implemented, then my proposal will help users. If setuptools issue78 is implemented, then my proposal will still help users. Regards, Zooko

On Mon, Nov 2, 2009 at 8:53 AM, Daniel Stutzbach <daniel@stutzbachenterprises.com> wrote:
IIUC your proposal doesn't get rid of the root of the problem (that there are two incompatible choices for Unicode string representation) but only proposes that there be a purely "abstract" API for working with string objects, which, if used religiously by extension modules, would allow them to be linked with either family of runtimes. This sounds attractive, but I kind of doubt that changing a single API is sufficient. Perhaps it would be useful to do a kind of review or survey of how many Unicode APIs are used by the typical extension? -- --Guido van Rossum (python.org/~guido)

On Mon, Nov 2, 2009 at 11:34 AM, Guido van Rossum <guido@python.org> wrote:
I made an editing error. I meant to suggest altering all the PyUnicode_* macro/functions, except those that explicitly use Py_UNICODE or PyUnicodeObject in their signature. PyUnicode_FromString was just an example. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

On Mon, Nov 2, 2009 at 9:45 AM, Daniel Stutzbach <daniel@stutzbachenterprises.com> wrote:
We'd also have to hide the macros that can be used to access the internals of a PyUnicodeObject, in order for that approach to be safe. Basically, an extension would have to include a second header file to use those macros and it would have to somehow indicate to the linker that it is using UCS2 or UCS4 internals as well. I would want to err on the safe side here -- if it was at all easy to create an extension that *seems* to be ABI-neutral but *actually* relies on knowledge about the UCS2 or UCS4 representation, we'd be creating a worse problem. Users don't like stuff not working, but they *really* don't like stuff crashing with random core dumps -- if it has to be broken, let it break very loudly and explicitly. The current approach satisfies that requirement -- it probably just errs too far on the "never assume it might work" side. -- --Guido van Rossum (python.org/~guido)

On Mon, Nov 2, 2009 at 11:57 AM, Guido van Rossum <guido@python.org> wrote:
I don't know of a portable way to indicate that to the linker simply by including a header file. I wish I did. Here is one idea that will cause a linker error if there's a mismatch and one of the macros are used. It does cause the macro to execute an extra CPU instruction or two, though. In unicodeobject.h: /* Require the macro to reference a global variable that will only be present if the Unicode ABI matches correctly. Arrange for the global variable to always have the value zero, and add it to the return value of the macro. */ #if Py_UNICODE_SIZE == 4 extern const int Py_UnicodeZero_UCS4; #define Py_UNICODE_ZERO (Py_UnicodeZero_UCS4) #else extern const int Py_UnicodeZero_UCS2; #define Py_UNICODE_ZERO (Py_UnicodeZero_UCS2) #endif #define PyUnicode_AS_UNICODE(op) \ (Py_UNICODE_ZERO + (((PyUnicodeObject *)(op))->str)) In unicodeobject.c: extern const int Py_UNICODE_ZERO = 0;
Agreed. -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Daniel Stutzbach, 02.11.2009 17:53:
Isn't that the main issue here? IMHO, if EasyInstall was fixed to distinguish extensions for UCS2/UCS4 platforms, that would just make the issue go away for most users. Not for extension builders and package maintainers, admittedly, but certainly for most users. Stefan

On Thu, Nov 5, 2009 at 1:35 AM, Stefan Behnel <stefan_ml@behnel.de> wrote:
If easy_install were fixed in the way suggested by PJE [1], eggs could effectively be labeled as "UCS2", "UCS4", or "Don't Care". Right now, all eggs are essentially labeled "Don't Care", even if they will fail to link. My proposal would greatly expand the number of eggs that can legitimately be labeled "Don't Care". It's a complementary proposal; fixing easy_install is certainly still important. :-) [1] http://bit.ly/1bO62 -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Please see also this thread: http://www.mail-archive.com/python-dev@python.org/msg42272.html This is a complementary proposal: that the Python devs should encourage the Linux distributors to converge on a common UCS2/4 choice. If the ABI improvement that you suggest is not adopted, then my proposal will help users. If the ABI improvement that you suggest is adopted, then my proposal will still help users. Likewise with the proposal to include the UCS2/4 configuration in the platform description on Linux: http://bugs.python.org/setuptools/ issue78 . If that proposal is not implemented, then my proposal will help users. If setuptools issue78 is implemented, then my proposal will still help users. Regards, Zooko
participants (4)
-
Daniel Stutzbach
-
Guido van Rossum
-
Stefan Behnel
-
Zooko Wilcox-O'Hearn