[Python-ideas] UCS2 vs UCS4 ABIs

Daniel Stutzbach daniel at stutzbachenterprises.com
Mon Nov 2 17:53:00 CET 2009


Scope
-----

This idea affects the CPython ABI for extension modules.  It has no impact
on the Python language syntax nor other Python implementations.

The Problem
-----------

Currently, Python can be built with an internal Unicode representation of
UCS2 or UCS4.  The two are binary incompatible, but the distinction is not
included as part of the platform name.  Consequently, if one installs a
binary egg (e.g., with easy_install), there's a good chance one will get an
error such as the following when trying to use it:

        undefined symbol: PyUnicodeUCS2_FromString

In Python 2, some extension modules can blissfully link to either ABI, as
the problem only arises for modules that call a PyUnicode_* macro (which
expands to calling either a PyUnicodeUCS2_* or PyUnicodeUCS4_* function).
For Python 3, every extension type will need to call a PyUnicode_* macro,
since __repr__ must return a Unicode object.

This problem has been known since at least 2006, as seen in this thread from
the distutils-sig:

http://markmail.org/message/bla5vrwlv3kn3n7e?q=thread:bla5vrwlv3kn3n7e

In that thread, it was suggested that the Unicode representation become part
of the platform name.  That change would require a distutils and/or
setuptools change, which has not happened and does not appear likely to
happen in the near future.  It would also mean that anyone who wants to
provide binary eggs for common platforms will need to provide twice as many
eggs.

Solution
--------

Get rid of the ABI difference for the 99% of extension modules that don't
care about the internal representation of Unicode strings.  From the
extension module's point of view, PyObject is opaque.  It will manipulate
the Unicode string entirely through PyUnicode_* function calls and does not
care about the internal representation.

For example, PyUnicode_FromString has the following signature in the
documentation:
        PyObject *PyUnicode_FromString(const char *u)
Currently, it's #ifdef'ed to either PyUnicodeUCS2_FromString or
PyUnicodeUCS4_FromString.

Remove the macro and name the function PyUnicode_FromString regardless of
which internal representation is being used.  The vast majority of binary
eggs will then work correctly on both UCS2 and UCS4 Pythons.

Functions that explicitly use Py_UNICODE or PyUnicodeObject as part of their
signature will continue to be #ifdef'ed, so extension modules that *do* care
about the internal representation will still generate a link error.

--
Daniel Stutzbach, Ph.D.
President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20091102/89325bc0/attachment.html>


More information about the Python-ideas mailing list