On Mon, Nov 2, 2009 at 8:53 AM, Daniel Stutzbach firstname.lastname@example.org wrote:
This idea affects the CPython ABI for extension modules. It has no impact on the Python language syntax nor other Python implementations.
Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. The two are binary incompatible, but the distinction is not included as part of the platform name. Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:
undefined symbol: PyUnicodeUCS2_FromString
In Python 2, some extension modules can blissfully link to either ABI, as the problem only arises for modules that call a PyUnicode_* macro (which expands to calling either a PyUnicodeUCS2_* or PyUnicodeUCS4_* function). For Python 3, every extension type will need to call a PyUnicode_* macro, since __repr__ must return a Unicode object.
This problem has been known since at least 2006, as seen in this thread from the distutils-sig:
In that thread, it was suggested that the Unicode representation become part of the platform name. That change would require a distutils and/or setuptools change, which has not happened and does not appear likely to happen in the near future. It would also mean that anyone who wants to provide binary eggs for common platforms will need to provide twice as many eggs.
Get rid of the ABI difference for the 99% of extension modules that don't care about the internal representation of Unicode strings. From the extension module's point of view, PyObject is opaque. It will manipulate the Unicode string entirely through PyUnicode_* function calls and does not care about the internal representation.
For example, PyUnicode_FromString has the following signature in the documentation: PyObject *PyUnicode_FromString(const char *u) Currently, it's #ifdef'ed to either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.
Remove the macro and name the function PyUnicode_FromString regardless of which internal representation is being used. The vast majority of binary eggs will then work correctly on both UCS2 and UCS4 Pythons.
Functions that explicitly use Py_UNICODE or PyUnicodeObject as part of their signature will continue to be #ifdef'ed, so extension modules that *do* care about the internal representation will still generate a link error.
IIUC your proposal doesn't get rid of the root of the problem (that there are two incompatible choices for Unicode string representation) but only proposes that there be a purely "abstract" API for working with string objects, which, if used religiously by extension modules, would allow them to be linked with either family of runtimes.
This sounds attractive, but I kind of doubt that changing a single API is sufficient. Perhaps it would be useful to do a kind of review or survey of how many Unicode APIs are used by the typical extension?