[Python-ideas] UCS2 vs UCS4 ABIs

Mon Nov 2 18:34:51 CET 2009

On Mon, Nov 2, 2009 at 8:53 AM, Daniel Stutzbach
<daniel at stutzbachenterprises.com> wrote:
> Scope
> -----
>
> This idea affects the CPython ABI for extension modules.  It has no impact
> on the Python language syntax nor other Python implementations.
>
> The Problem
> -----------
>
> Currently, Python can be built with an internal Unicode representation of
> UCS2 or UCS4.  The two are binary incompatible, but the distinction is not
> included as part of the platform name.  Consequently, if one installs a
> binary egg (e.g., with easy_install), there's a good chance one will get an
> error such as the following when trying to use it:
>
>         undefined symbol: PyUnicodeUCS2_FromString
>
> In Python 2, some extension modules can blissfully link to either ABI, as
> the problem only arises for modules that call a PyUnicode_* macro (which
> expands to calling either a PyUnicodeUCS2_* or PyUnicodeUCS4_* function).
> For Python 3, every extension type will need to call a PyUnicode_* macro,
> since __repr__ must return a Unicode object.
>
> This problem has been known since at least 2006, as seen in this thread from
> the distutils-sig:
>
> http://markmail.org/message/bla5vrwlv3kn3n7e?q=thread:bla5vrwlv3kn3n7e
>
> In that thread, it was suggested that the Unicode representation become part
> of the platform name.  That change would require a distutils and/or
> setuptools change, which has not happened and does not appear likely to
> happen in the near future.  It would also mean that anyone who wants to
> provide binary eggs for common platforms will need to provide twice as many
> eggs.
>
> Solution
> --------
>
> Get rid of the ABI difference for the 99% of extension modules that don't
> care about the internal representation of Unicode strings.  From the
> extension module's point of view, PyObject is opaque.  It will manipulate
> the Unicode string entirely through PyUnicode_* function calls and does not
> care about the internal representation.
>
> For example, PyUnicode_FromString has the following signature in the
> documentation:
>         PyObject *PyUnicode_FromString(const char *u)
> Currently, it's #ifdef'ed to either PyUnicodeUCS2_FromString or
> PyUnicodeUCS4_FromString.
>
> Remove the macro and name the function PyUnicode_FromString regardless of
> which internal representation is being used.  The vast majority of binary
> eggs will then work correctly on both UCS2 and UCS4 Pythons.
>
> Functions that explicitly use Py_UNICODE or PyUnicodeObject as part of their
> signature will continue to be #ifdef'ed, so extension modules that *do* care
> about the internal representation will still generate a link error.

IIUC your proposal doesn't get rid of the root of the problem (that
there are two incompatible choices for Unicode string representation)
but only proposes that there be a purely "abstract" API for working
with string objects, which, if used religiously by extension modules,
would allow them to be linked with either family of runtimes.

This sounds attractive, but I kind of doubt that changing a single API
is sufficient. Perhaps it would be useful to do a kind of review or
survey of how many Unicode APIs are used by the typical extension?

-- 
--Guido van Rossum (python.org/~guido)