I've uncovered what seems to me to a problem with python Unicode string objects passed to extension modules. Or perhaps it's revealing a misunderstanding on my part :-) So I would like to get some clarification.
It seems to me that there is indeed one or more misunderstandings on your part. Please discuss them on comp.lang.python.
Extension modules written in C receive strings from python via the PyArg_ParseTuple family. Most extension modules use the 's' or 's#' format parameter.
Many C libraries in Linux use the UTF-8 encoding.
The 's' format when passed a Unicode object will encode the string according to the default encoding which is immutably set to 'ascii' in site.py. Thus a C library expecting UTF-8 which uses the 's' format in PyArg_ParseTuple will get an encoding error when passed a Unicode string which contains any code points outside the ascii range.
The C library isn't expecting using the 's' format. A Python module wrapping the C library is. So whatever conversion is necessary should be done by that Python module.
Now my questions:
* Is the use of the 's' or 's*' format parameter in an extension binding expecting UTF-8 fundamentally broken and not expected to work? Instead should the binding be using a format conversion which specifies the desired encoding, e.g. 'es' or 'es#'?
Yes. Alternatively, require the callers to pass UTF-8 byte strings, not Unicode strings.
* The extension modules could successfully use the 's' or 's#' format conversion in a UTF-8 environment if the default encoding was UTF-8. Changing the default encoding to UTF-8 would in one easy stroke "fix" most extension modules, right?
Wrong. This assumes that "most" libraries do indeed specify their APIs in terms of UTF-8. I don't think that is a fact; not in the world of 2008.
Why is the default encoding 'ascii' in UTF-8 environments and why is the default encoding prohibited from being changed from ascii?
There are several reasons, all off-topic for python-dev. ASCII was considered the most safe assumption: when converting between byte and Unicode strings in the absence of an encoding specification, you can't assume anything but ASCII (technically, not even that, as the bytes may be EBCDIC, but ASCII is safe for the majority of the systems - unlike UTF-8). The encoding can't be changed because that would break hash().
* Did Python 2.5 introduce anything which now makes this issue visible whereas before it was masked by some other behavior?
I don't know. Can you please be a bit more specific (on comp.lang.python) where you suspect such a change? Regards, Martin