[New-bugs-announce] [issue8654] Improve ABI compatibility between UCS2 and UCS4 builds
report at bugs.python.org
Fri May 7 21:02:45 CEST 2010
New submission from Daniel Stutzbach <daniel at stutzbachenterprises.com>:
Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions. For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.
Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:
undefined symbol: PyUnicodeUCS2_FromString
In Python 2, only some extension modules were stung by this problem. For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way).
Right now, nearly all the functions in unicodeobject.h are wrapped. Several functions are not. Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added.
Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4. We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view.
For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped.
On the other hand, PyUnicode_FromString creates a Unicode object from a char *. The caller can treat the returned object as opaque, so the function should not be wrapped.
The attached patch implements that rule. It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped. "make test" works with both UCS2 and UCS4 builds.
I previously brought this issue up on python-ideas, see:
Here's a summary of that discussion:
Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding.
Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require. PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers. In other words, they are complimentary fixes.
Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction.
With my current patch, there are still two ways for that to happen:
1) The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or
2) The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros.
Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions. Consequently, they will still generate a linker error if the encoding is mismatched, as desired.
I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions. I'll post about that in a separate comment to this bug.
components: Interpreter Core, Unicode
stage: needs patch
title: Improve ABI compatibility between UCS2 and UCS4 builds
versions: Python 3.2
Python tracker <report at bugs.python.org>
More information about the New-bugs-announce