Unicode <--> UTF-8 in CPython extension modules
I've uncovered what seems to me to a problem with python Unicode
string objects passed to extension modules. Or perhaps it's revealing
a misunderstanding on my part :-) So I would like to get some
clarification.
Extension modules written in C receive strings from python via the
PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
format parameter.
Many C libraries in Linux use the UTF-8 encoding.
The 's' format when passed a Unicode object will encode the string
according to the default encoding which is immutably set to 'ascii' in
site.py. Thus a C library expecting UTF-8 which uses the 's' format in
PyArg_ParseTuple will get an encoding error when passed a Unicode
string which contains any code points outside the ascii range.
Now my questions:
* Is the use of the 's' or 's*' format parameter in an extension
binding expecting UTF-8 fundamentally broken and not expected to
work? Instead should the binding be using a format conversion which
specifies the desired encoding, e.g. 'es' or 'es#'?
* The extension modules could successfully use the 's' or 's#' format
conversion in a UTF-8 environment if the default encoding was
UTF-8. Changing the default encoding to UTF-8 would in one easy
stroke "fix" most extension modules, right? Why is the default
encoding 'ascii' in UTF-8 environments and why is the default
encoding prohibited from being changed from ascii?
* Did Python 2.5 introduce anything which now makes this issue visible
whereas before it was masked by some other behavior?
Summary:
Python programs which use Unicode string objects for their i18n and
which "link" to C libraries expecting UTF-8 but which have a CPython
binding which only uses 's' or 's#' formats programs seem to often
fail with encoding errors. However, I have yet to see a CPython
binding which does explicitly define it's encoding requirements. This
suggests to me I either do not understand the issue in it's entirety
or many CPython bindings in Linux UTF-8 environments are broken with
respect to their i18n handling and the problem is currently
not addressed.
--
John Dennis
I've uncovered what seems to me to a problem with python Unicode string objects passed to extension modules. Or perhaps it's revealing a misunderstanding on my part :-) So I would like to get some clarification.
It seems to me that there is indeed one or more misunderstandings on your part. Please discuss them on comp.lang.python.
Extension modules written in C receive strings from python via the PyArg_ParseTuple family. Most extension modules use the 's' or 's#' format parameter.
Many C libraries in Linux use the UTF-8 encoding.
The 's' format when passed a Unicode object will encode the string according to the default encoding which is immutably set to 'ascii' in site.py. Thus a C library expecting UTF-8 which uses the 's' format in PyArg_ParseTuple will get an encoding error when passed a Unicode string which contains any code points outside the ascii range.
The C library isn't expecting using the 's' format. A Python module wrapping the C library is. So whatever conversion is necessary should be done by that Python module.
Now my questions:
* Is the use of the 's' or 's*' format parameter in an extension binding expecting UTF-8 fundamentally broken and not expected to work? Instead should the binding be using a format conversion which specifies the desired encoding, e.g. 'es' or 'es#'?
Yes. Alternatively, require the callers to pass UTF-8 byte strings, not Unicode strings.
* The extension modules could successfully use the 's' or 's#' format conversion in a UTF-8 environment if the default encoding was UTF-8. Changing the default encoding to UTF-8 would in one easy stroke "fix" most extension modules, right?
Wrong. This assumes that "most" libraries do indeed specify their APIs in terms of UTF-8. I don't think that is a fact; not in the world of 2008.
Why is the default encoding 'ascii' in UTF-8 environments and why is the default encoding prohibited from being changed from ascii?
There are several reasons, all off-topic for python-dev. ASCII was considered the most safe assumption: when converting between byte and Unicode strings in the absence of an encoding specification, you can't assume anything but ASCII (technically, not even that, as the bytes may be EBCDIC, but ASCII is safe for the majority of the systems - unlike UTF-8). The encoding can't be changed because that would break hash().
* Did Python 2.5 introduce anything which now makes this issue visible whereas before it was masked by some other behavior?
I don't know. Can you please be a bit more specific (on comp.lang.python) where you suspect such a change? Regards, Martin
On Fri, Feb 22, 2008 at 4:23 PM, John Dennis
Python programs which use Unicode string objects for their i18n and which "link" to C libraries expecting UTF-8 but which have a CPython binding which only uses 's' or 's#' formats programs seem to often fail with encoding errors.
One thing to be aware of is that PyGTK+ actually sets the Python Unicode object encoding to UTF-8. http://bugzilla.gnome.org/show_bug.cgi?id=132040 I mention this because PyGTK is a very popular library related to Python and Linux. So currently if you "import gtk", then libraries which are using UTF-8 (as you say, the vast majority) will work with Python unicode objects unmodified.
On 2008-02-23 00:46, Colin Walters wrote:
On Fri, Feb 22, 2008 at 4:23 PM, John Dennis
wrote: Python programs which use Unicode string objects for their i18n and which "link" to C libraries expecting UTF-8 but which have a CPython binding which only uses 's' or 's#' formats programs seem to often fail with encoding errors.
One thing to be aware of is that PyGTK+ actually sets the Python Unicode object encoding to UTF-8.
http://bugzilla.gnome.org/show_bug.cgi?id=132040
I mention this because PyGTK is a very popular library related to Python and Linux. So currently if you "import gtk", then libraries which are using UTF-8 (as you say, the vast majority) will work with Python unicode objects unmodified.
Are you suggesting that John should rely on a bug in some 3rd party extension instead of fixing the Python extension to use "es#" where needed ? There's a good reason why we don't allow setting the default encoding outside site.py. Trying to play tricks to change the default encoding later on will only cause problems, e.g. the cached default encoded versions of Unicode objects will then use different encodings - the one set in site.py and later the ones with the new encoding. As a result, all kind of weird things can happen. Using the Python Unicode C API really isn't all that hard and it's well documented too, so please use it instead of trying to design software based on workarounds. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 23 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
Colin Walters wrote:
On Fri, Feb 22, 2008 at 4:23 PM, John Dennis
wrote: Python programs which use Unicode string objects for their i18n and which "link" to C libraries expecting UTF-8 but which have a CPython binding which only uses 's' or 's#' formats programs seem to often fail with encoding errors.
One thing to be aware of is that PyGTK+ actually sets the Python Unicode object encoding to UTF-8.
http://bugzilla.gnome.org/show_bug.cgi?id=132040
I mention this because PyGTK is a very popular library related to Python and Linux. So currently if you "import gtk", then libraries which are using UTF-8 (as you say, the vast majority) will work with Python unicode objects unmodified.
Thank you Colin, your input was very helpful. The fact PyGTK's i18n
handling worked was the counter example which made me doubt my analysis
was correct but I can see from the Gnome bug report and Martin's
subsequent comment that the analysis was sound. It had perplexed me
enormously why in some circumstances i18n handling worked but failed in
others. Apparently it was a side effect of importing gtk, a problem
exacerbated when either the sequence of imports or the complete set of
imports was not taken into account.
I am aware of other python bindings (libxml2 is one example) which share
the same mistake of not using the 'es' family of format conversions when
the underlying library is UTF-8. At least I now understand why
incorrectly coded bindings in some circumstances produced correct
results when logic dictated they shouldn't.
--
John Dennis
participants (4)
-
"Martin v. Löwis"
-
Colin Walters
-
John Dennis
-
M.-A. Lemburg