[Python-Dev] Unicode <--> UTF-8 in CPython extension modules

John Dennis jdennis at redhat.com
Sat Feb 23 01:50:32 CET 2008

Colin Walters wrote:
> On Fri, Feb 22, 2008 at 4:23 PM, John Dennis <jdennis at redhat.com> wrote:
>>  Python programs which use Unicode string objects for their i18n and
>>  which "link" to C libraries expecting UTF-8 but which have a CPython
>>  binding which only uses 's' or 's#' formats programs seem to often
>>  fail with encoding errors.
> One thing to be aware of is that PyGTK+ actually sets the Python
> Unicode object encoding to UTF-8.
> http://bugzilla.gnome.org/show_bug.cgi?id=132040
> I mention this because PyGTK is a very popular library related to
> Python and Linux.  So currently if you "import gtk", then libraries
> which are using UTF-8 (as you say, the vast majority) will work with
> Python unicode objects unmodified.

Thank you Colin, your input was very helpful. The fact PyGTK's i18n 
handling worked was the counter example which made me doubt my analysis 
was correct but I can see from the Gnome bug report and Martin's 
subsequent comment that the analysis was sound. It had perplexed me 
enormously why in some circumstances i18n handling worked but failed in 
others. Apparently it was a side effect of importing gtk, a problem 
exacerbated when either the sequence of imports or the complete set of 
imports was not taken into account.

I am aware of other python bindings (libxml2 is one example) which share 
the same mistake of not using the 'es' family of format conversions when 
the underlying library is UTF-8. At least I now understand why 
incorrectly coded bindings in some circumstances produced correct 
results when logic dictated they shouldn't.

John Dennis <jdennis at redhat.com>

More information about the Python-Dev mailing list