[Python-Dev] Unicode <--> UTF-8 in CPython extension modules
jdennis at redhat.com
Fri Feb 22 22:23:58 CET 2008
I've uncovered what seems to me to a problem with python Unicode
string objects passed to extension modules. Or perhaps it's revealing
a misunderstanding on my part :-) So I would like to get some
Extension modules written in C receive strings from python via the
PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
Many C libraries in Linux use the UTF-8 encoding.
The 's' format when passed a Unicode object will encode the string
according to the default encoding which is immutably set to 'ascii' in
site.py. Thus a C library expecting UTF-8 which uses the 's' format in
PyArg_ParseTuple will get an encoding error when passed a Unicode
string which contains any code points outside the ascii range.
Now my questions:
* Is the use of the 's' or 's*' format parameter in an extension
binding expecting UTF-8 fundamentally broken and not expected to
work? Instead should the binding be using a format conversion which
specifies the desired encoding, e.g. 'es' or 'es#'?
* The extension modules could successfully use the 's' or 's#' format
conversion in a UTF-8 environment if the default encoding was
UTF-8. Changing the default encoding to UTF-8 would in one easy
stroke "fix" most extension modules, right? Why is the default
encoding 'ascii' in UTF-8 environments and why is the default
encoding prohibited from being changed from ascii?
* Did Python 2.5 introduce anything which now makes this issue visible
whereas before it was masked by some other behavior?
Python programs which use Unicode string objects for their i18n and
which "link" to C libraries expecting UTF-8 but which have a CPython
binding which only uses 's' or 's#' formats programs seem to often
fail with encoding errors. However, I have yet to see a CPython
binding which does explicitly define it's encoding requirements. This
suggests to me I either do not understand the issue in it's entirety
or many CPython bindings in Linux UTF-8 environments are broken with
respect to their i18n handling and the problem is currently
John Dennis <jdennis at redhat.com>
More information about the Python-Dev