I accidentally sent this letter just to MAL when I intended it to python-dev. Please read it, as it explains why the issue I'm raising is not just the "we should switch to ucs4 because it is better" issue that was previously settled by GvR. This is a current, practical problem that is preventing people from distributing and using Python packages with binary extension modules on Linux.
---------- Forwarded message ---------- From: Zooko O'Whielacronx firstname.lastname@example.org Date: Sun, Sep 27, 2009 at 11:43 AM Subject: Re: [Python-Dev] please consider changing --enable-unicode default to ucs4 To: "M.-A. Lemburg" email@example.com
I'm sorry, I think I didn't make my concern clear. My users, and lots of other users, are having a problem with incompatibility between Python binary extension modules. One way to improve the situation would be if the Python devs would use their "bully pulpit" -- their unique position as a source respected by all Linux distributions -- and say "We recommend that Linux distributions use UCS4 for compatibility with one another". This would not abrogate anyone's ability to choose their preferred setting nor, as far as I can tell, would it interfere with the ongoing development of Python.
Here are the details:
I'm the maintainer of several Python packages. I work hard to make it easy for users, even users who don't know anything about Python, to use my software. There have been many pain points in this process and I've spent a lot of time on it for about three years now working on packaging, including the tools such as setuptools and distutils and the new "distribute" tool. Python packaging has been improving during these years -- things are looking up.
One of the remaining pain points is that I can distribute binaries of my Python extension modules for Windows or Mac, but if I distribute a binary Python extension module on Linux, then if the user has a different UCS2/UCS4 setting then they won't be able to use the extension module. The current de facto standard for Linux is UCS4 -- it is used by Debian, Ubuntu, Fedora, RHEL, OpenSuSE, etc. etc.. The vast majority of Linux users in practice have UCS4, and most binary Python modules are compiled for UCS4.
That means that a few folks will get left out. Those folks, from my experience, are people who built their python executable themselves without specifying an override for the default, and the smaller Linux distributions who insist on doing whatever upstream Python devs recommend instead of doing whatever the other Linux distros are doing. One of the data points that I reported was a Python interpreter that was built locally on an Ubuntu server. Since the person building it didn't know to override the default setting of --enable-unicode, he ended up with a Python interpreter built for UCS2, even though all the Python extension modules shipped by Ubuntu were built with UCS4.
These are not isolated incidents. The following google searches suggest that a number of people spend time trying to figure out why Python extension modules fail on their linux systems:
http://www.google.com/search?q=PyUnicodeUCS4_FromUnicode+undefined+symbol http://www.google.com/search?q=+PyUnicodeUCS2_FromUnicode+undefined+symbol http://www.google.com/search?q=_PyUnicodeUCS2_AsDefaultEncodedString+undefin...
Another data point is the Mandriva Linux distribution. It is probably much smaller than Debian, Ubuntu, or RedHat, but it is still one of the major, well-known distributions. I requested of the Python maintainer for Mandriva, Michael Scherer, that they switch from UCS2 to UCS4 in order to reduce compatibility problems like these. His answer as I understood it was that it is best to follow the recommendations of the upstream Python devs by using the default setting instead of choosing a setting for himself.
(Now we could implement a protocol which would show whether a given Python package was compiled for UCS2 or UCS4. That would be good. Hopefully it would make incompatibility more explicit and understandable to users. Here is a ticket for that -- which project I am contributing to: http://bugs.python.org/setuptools/issue78 . However, even if we implement that feature in the distribute tool (the successor to setuptools), users who build their own python or who use a Linux distribution that follows upstream configuration defaults will still be unable to use most Python packages with compiled extension modules.)
In a message on this thread, MvL wrote:
"For that reason I think it's also better that the configure script continues to default to UTF-16 -- this will give the UTF-16 support code the necessary exercise."
This is effectively a BDFL pronouncement. Nothing has changed the validity of the premise of the statement, so the conclusion remains valid, as well.
My understand of the earlier thread was that someone suggested that UCS4 would be technically better and GvR decided that there were technical reasons to continue actively maintaining the UCS2 code. This thread is different: I'm saying that users are suffering packaging problems and asking for help with that. The way that python-dev can help is to make it so that people who choose "Whatever the upstream devs prefer" (--enable-unicde) on Linux get the de facto standard setting.
In the earlier thread, GvR wrote: "I think we should continue to leave this up to the distribution. AFAIK many Linux distros already use UCS4 for everything anyway.". But at least some distributions are asking the upstream Python devs to choose for them, by leaving the setting at the default.
Hm, pondering GvR's words: "I think we should continue to leave this up to the distribution", I have a new proposal: make it so that on Linux only "--enable-unicode" errors out with an error message saying "please choose either --enable-unicode=ucs2 or --enable-unicode=ucs4. ucs4 is the most widely used setting on Linux. See http://python.org/wiki/UCS2_vs_UCS4 for details.". This would force those Linux distributions who are not currently deciding to decide.
Thank you for your attention.
Zooko O'Whielacronx <zookog <at> gmail.com> writes:
I accidentally sent this letter just to MAL when I intended it to python-dev. Please read it, as it explains why the issue I'm raising is not just the "we should switch to ucs4 because it is better" issue that was previously settled by GvR.
For what it's worth, with stringbench under py3k, an UCS2 build is roughly 8% faster than an UCS4 build (190 s. total against 206 s.).