[Python-Dev] please consider changing --enable-unicode default to ucs4

Zooko O'Whielacronx zookog at gmail.com
Wed Oct 7 19:07:37 CEST 2009


Folks:

I accidentally sent this letter just to MAL when I intended it to
python-dev.  Please read it, as it explains why the issue I'm raising
is not just the "we should switch to ucs4 because it is better" issue
that was previously settled by GvR.  This is a current, practical
problem that is preventing people from distributing and using Python
packages with binary extension modules on Linux.

Regards,

Zooko


---------- Forwarded message ----------
From: Zooko O'Whielacronx <zookog at gmail.com>
Date: Sun, Sep 27, 2009 at 11:43 AM
Subject: Re: [Python-Dev] please consider changing --enable-unicode
default to ucs4
To: "M.-A. Lemburg" <mal at egenix.com>


Folks:

I'm sorry, I think I didn't make my concern clear.  My users, and lots
of other users, are having a problem with incompatibility between
Python binary extension modules.  One way to improve the situation
would be if the Python devs would use their "bully pulpit" -- their
unique position as a source respected by all Linux distributions --
and say "We recommend that Linux distributions use UCS4 for
compatibility with one another".  This would not abrogate anyone's
ability to choose their preferred setting nor, as far as I can tell,
would it interfere with the ongoing development of Python.

Here are the details:

I'm the maintainer of several Python packages.  I work hard to make it
easy for users, even users who don't know anything about Python, to
use my software.  There have been many pain points in this process and
I've spent a lot of time on it for about three years now working on
packaging, including the tools such as setuptools and distutils and
the new "distribute" tool.  Python packaging has been improving during
these years -- things are looking up.

One of the remaining pain points is that I can distribute binaries of
my Python extension modules for Windows or Mac, but if I distribute a
binary Python extension module on Linux, then if the user has a
different UCS2/UCS4 setting then they won't be able to use the
extension module.  The current de facto standard for Linux is UCS4 --
it is used by Debian, Ubuntu, Fedora, RHEL, OpenSuSE, etc. etc..  The
vast majority of Linux users in practice have UCS4, and most binary
Python modules are compiled for UCS4.

That means that a few folks will get left out.  Those folks, from my
experience, are people who built their python executable themselves
without specifying an override for the default, and the smaller Linux
distributions who insist on doing whatever upstream Python devs
recommend instead of doing whatever the other Linux distros are doing.
 One of the data points that I reported was a Python interpreter that
was built locally on an Ubuntu server.  Since the person building it
didn't know to override the default setting of --enable-unicode, he
ended up with a Python interpreter built for UCS2, even though all the
Python extension modules shipped by Ubuntu were built with UCS4.

These are not isolated incidents.  The following google searches
suggest that a number of people spend time trying to figure out why
Python extension modules fail on their linux systems:

http://www.google.com/search?q=PyUnicodeUCS4_FromUnicode+undefined+symbol
http://www.google.com/search?q=+PyUnicodeUCS2_FromUnicode+undefined+symbol
http://www.google.com/search?q=_PyUnicodeUCS2_AsDefaultEncodedString+undefined+symbol

Another data point is the Mandriva Linux distribution.  It is probably
much smaller than Debian, Ubuntu, or RedHat, but it is still one of
the major, well-known distributions.  I requested of the Python
maintainer for Mandriva, Michael Scherer, that they switch from UCS2
to UCS4 in order to reduce compatibility problems like these.  His
answer as I understood it was that it is best to follow the
recommendations of the upstream Python devs by using the default
setting instead of choosing a setting for himself.

(Now we could implement a protocol which would show whether a given
Python package was compiled for UCS2 or UCS4.  That would be good.
Hopefully it would make incompatibility more explicit and
understandable to users.  Here is a ticket for that -- which project I
am contributing to: http://bugs.python.org/setuptools/issue78 .
However, even if we implement that feature in the distribute tool (the
successor to setuptools), users who build their own python or who use
a Linux distribution that follows upstream configuration defaults will
still be unable to use most Python packages with compiled extension
modules.)

In a message on this thread, MvL wrote:

> "For that reason I think it's also better that the configure script
> continues to default to UTF-16 -- this will give the UTF-16 support
> code the necessary exercise."
>
> This is effectively a BDFL pronouncement. Nothing has changed the
> validity of the premise of the statement, so the conclusion remains
> valid, as well.

My understand of the earlier thread was that someone suggested that
UCS4 would be technically better and GvR decided that there were
technical reasons to continue actively maintaining the UCS2 code.
This thread is different: I'm saying that users are suffering
packaging problems and asking for help with that.  The way that
python-dev can help is to make it so that people who choose "Whatever
the upstream devs prefer" (--enable-unicde) on Linux get the de facto
standard setting.

In the earlier thread, GvR wrote: "I think we should continue to leave
this up to the distribution. AFAIK many Linux distros already use UCS4
for everything anyway.".  But at least some distributions are asking
the upstream Python devs to choose for them, by leaving the setting at
the default.

Hm, pondering GvR's words: "I think we should continue to leave this
up to the distribution", I have a new proposal: make it so that *on
Linux only* "--enable-unicode" errors out with an error message saying
"please choose either --enable-unicode=ucs2 or --enable-unicode=ucs4.
ucs4 is the most widely used setting on Linux. See
http://python.org/wiki/UCS2_vs_UCS4 for details.".  This would force
those Linux distributions who are *not* currently deciding to decide.

Thank you for your attention.

Regards,

Zooko


More information about the Python-Dev mailing list