[Python-Dev] Unicode <--> UTF-8 in CPython extension modules

John Dennis jdennis at redhat.com
Fri Feb 22 22:23:58 CET 2008

I've uncovered what seems to me to a problem with python Unicode
string objects passed to extension modules. Or perhaps it's revealing
a misunderstanding on my part :-) So I would like to get some

Extension modules written in C receive strings from python via the
PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
format parameter.

Many C libraries in Linux use the UTF-8 encoding.

The 's' format when passed a Unicode object will encode the string
according to the default encoding which is immutably set to 'ascii' in
site.py. Thus a C library expecting UTF-8 which uses the 's' format in
PyArg_ParseTuple will get an encoding error when passed a Unicode
string which contains any code points outside the ascii range.

Now my questions:

* Is the use of the 's' or 's*' format parameter in an extension
   binding expecting UTF-8 fundamentally broken and not expected to
   work?  Instead should the binding be using a format conversion which
   specifies the desired encoding, e.g. 'es' or 'es#'?

* The extension modules could successfully use the 's' or 's#' format
   conversion in a UTF-8 environment if the default encoding was
   UTF-8. Changing the default encoding to UTF-8 would in one easy
   stroke "fix" most extension modules, right? Why is the default
   encoding 'ascii' in UTF-8 environments and why is the default
   encoding prohibited from being changed from ascii?

* Did Python 2.5 introduce anything which now makes this issue visible
   whereas before it was masked by some other behavior?


Python programs which use Unicode string objects for their i18n and
which "link" to C libraries expecting UTF-8 but which have a CPython
binding which only uses 's' or 's#' formats programs seem to often
fail with encoding errors. However, I have yet to see a CPython
binding which does explicitly define it's encoding requirements. This
suggests to me I either do not understand the issue in it's entirety
or many CPython bindings in Linux UTF-8 environments are broken with
respect to their i18n handling and the problem is currently
not addressed.

John Dennis <jdennis at redhat.com>

More information about the Python-Dev mailing list