Mailman 3 Unicode <--> UTF-8 in CPython extension modules - Python-Dev

22 Feb 2008

      I've uncovered what seems to me to a problem with python Unicode
string objects passed to extension modules. Or perhaps it's revealing
a misunderstanding on my part :-) So I would like to get some
clarification.

Extension modules written in C receive strings from python via the
PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
format parameter.

Many C libraries in Linux use the UTF-8 encoding.

The 's' format when passed a Unicode object will encode the string
according to the default encoding which is immutably set to 'ascii' in
site.py. Thus a C library expecting UTF-8 which uses the 's' format in
PyArg_ParseTuple will get an encoding error when passed a Unicode
string which contains any code points outside the ascii range.

Now my questions:

* Is the use of the 's' or 's*' format parameter in an extension
   binding expecting UTF-8 fundamentally broken and not expected to
   work?  Instead should the binding be using a format conversion which
   specifies the desired encoding, e.g. 'es' or 'es#'?

* The extension modules could successfully use the 's' or 's#' format
   conversion in a UTF-8 environment if the default encoding was
   UTF-8. Changing the default encoding to UTF-8 would in one easy
   stroke "fix" most extension modules, right? Why is the default
   encoding 'ascii' in UTF-8 environments and why is the default
   encoding prohibited from being changed from ascii?

* Did Python 2.5 introduce anything which now makes this issue visible
   whereas before it was masked by some other behavior?

Summary:

Python programs which use Unicode string objects for their i18n and
which "link" to C libraries expecting UTF-8 but which have a CPython
binding which only uses 's' or 's#' formats programs seem to often
fail with encoding errors. However, I have yet to see a CPython
binding which does explicitly define it's encoding requirements. This
suggests to me I either do not understand the issue in it's entirety
or many CPython bindings in Linux UTF-8 environments are broken with
respect to their i18n handling and the problem is currently
not addressed.

-- 
John Dennis 

Unicode <--> UTF-8 in CPython extension modules

John Dennis

"Martin v. Löwis"

Colin Walters

M.-A. Lemburg

John Dennis

tags

participants (4)