Re: [Python-Dev] Unicode <--> UTF-8 in CPython extension modules

23 Feb 2008


      ...
I've uncovered what seems to me to a problem with python Unicode
string objects passed to extension modules. Or perhaps it's revealing
a misunderstanding on my part :-) So I would like to get some
clarification.
It seems to me that there is indeed one or more misunderstandings
on your part. Please discuss them on comp.lang.python.
...
Extension modules written in C receive strings from python via the
PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
format parameter.
Many C libraries in Linux use the UTF-8 encoding.
The 's' format when passed a Unicode object will encode the string
according to the default encoding which is immutably set to 'ascii' in
site.py. Thus a C library expecting UTF-8 which uses the 's' format in
PyArg_ParseTuple will get an encoding error when passed a Unicode
string which contains any code points outside the ascii range.
The C library isn't expecting  using the 's' format. A Python module
wrapping the C library is. So whatever conversion is necessary should
be done by that Python module.
...
Now my questions:
* Is the use of the 's' or 's*' format parameter in an extension
   binding expecting UTF-8 fundamentally broken and not expected to
   work?  Instead should the binding be using a format conversion which
   specifies the desired encoding, e.g. 'es' or 'es#'?
Yes. Alternatively, require the callers to pass UTF-8 byte strings,
not Unicode strings.
...
* The extension modules could successfully use the 's' or 's#' format
   conversion in a UTF-8 environment if the default encoding was
   UTF-8. Changing the default encoding to UTF-8 would in one easy
   stroke "fix" most extension modules, right?
Wrong. This assumes that "most" libraries do indeed specify their
APIs in terms of UTF-8. I don't think that is a fact; not in the world
of 2008.
...
Why is the default
   encoding 'ascii' in UTF-8 environments and why is the default
   encoding prohibited from being changed from ascii?
There are several reasons, all off-topic for python-dev.
ASCII was considered the most safe assumption: when
converting between byte and Unicode strings in the absence of an
encoding specification, you can't assume anything but ASCII
(technically, not even that, as the bytes may be EBCDIC, but ASCII
is safe for the majority of the systems - unlike UTF-8).
The encoding can't be changed because that would break hash().
...
* Did Python 2.5 introduce anything which now makes this issue visible
   whereas before it was masked by some other behavior?
I don't know. Can you please be a bit more specific (on 
comp.lang.python) where you suspect such a change?

Regards,
Martin

Re: [Python-Dev] Unicode <--> UTF-8 in CPython extension modules

"Martin v. Löwis"