New subject: please consider changing --enable-unicode default to ucs4

Sept. 28, 2009

      Zooko O'Whielacronx wrote:
...
Folks:
I'm sorry, I think I didn't make my concern clear.  My users, and lots
of other users, are having a problem with incompatibility between
Python binary extension modules.  One way to improve the situation
would be if the Python devs would use their "bully pulpit" -- their
unique position as a source respected by all Linux distributions --
and say "We recommend that Linux distributions use UCS4 for
compatibility with one another".  This would not abrogate anyone's
ability to choose their preferred setting nor, as far as I can tell,
would it interfere with the ongoing development of Python.
-1

Please note that we did not choose to ship Python as UCS4 binary
on Linux - the Linux distributions did.

The Python default is UCS2 for a good reason: it's a good trade-off
between memory consumption, functionality and performance.

As already mentioned, I also don't understand how the changing
the Python default on Linux would help your users in any way -
if you let distutils compile your extensions, it's automatically
going to use the right Unicode setting for you (as well as your
users).

Unfortunately, this automatic support doesn't help you when
shipping e.g. setuptools eggs, but this is a tool problem,
not one of Python: setuptools completely ignores the fact
that there are two ways to build Python.

I'd suggest you ask the tool maintainers to adjust their tools
to support the Python Unicode option.
...
Here are the details:
I'm the maintainer of several Python packages.  I work hard to make it
easy for users, even users who don't know anything about Python, to
use my software.  There have been many pain points in this process and
I've spent a lot of time on it for about three years now working on
packaging, including the tools such as setuptools and distutils and
the new "distribute" tool.  Python packaging has been improving during
these years -- things are looking up.
One of the remaining pain points is that I can distribute binaries of
my Python extension modules for Windows or Mac, but if I distribute a
binary Python extension module on Linux, then if the user has a
different UCS2/UCS4 setting then they won't be able to use the
extension module.  The current de facto standard for Linux is UCS4 --
it is used by Debian, Ubuntu, Fedora, RHEL, OpenSuSE, etc. etc..  The
vast majority of Linux users in practice have UCS4, and most binary
Python modules are compiled for UCS4.
That means that a few folks will get left out.  Those folks, from my
experience, are people who built their python executable themselves
without specifying an override for the default, and the smaller Linux
distributions who insist on doing whatever upstream Python devs
recommend instead of doing whatever the other Linux distros are doing.
 One of the data points that I reported was a Python interpreter that
was built locally on an Ubuntu server.  Since the person building it
didn't know to override the default setting of --enable-unicode, he
ended up with a Python interpreter built for UCS2, even though all the
Python extension modules shipped by Ubuntu were built with UCS4.
People building their own Python version will usually also build
their own extensions, so I don't really believe that the above
scenario is very common.

Also note that Python will complain loudly when you try to load
a UCS2 extension in a UCS4 build and vice-versa. We've made sure
that any extension using the Python Unicode C API has to be built
for the same UCS version of Python. This is done by using different
names for the C APIs at the C level.
...
These are not isolated incidents.  The following google searches
suggest that a number of people spend time trying to figure out why
Python extension modules fail on their linux systems:
http://www.google.com/search?q=PyUnicodeUCS4_FromUnicode+undefined+symbol
http://www.google.com/search?q=+PyUnicodeUCS2_FromUnicode+undefined+symbol
http://www.google.com/search?q=_PyUnicodeUCS2_AsDefaultEncodedString+undefin...
Perhaps we should add a FAQ entry for these linker errors
(which are caused by the mentioned C API changes to prevent
mixing UCS version) ?!

Here's a quick way to determine you Python Unicode build type:

python -c "import sys;print((sys.maxunicode<66000)and'UCS2'or'UCS4')"

Perhaps we should include this info as well as an 32/64-bit indicator
and the processor type in the Python startup line:

# python
Python 2.6 (r26:66714, Feb  3 2009, 20:49:49, UCS4, 64-bit, x86_64)
[GCC 4.3.2 [gcc-4_3-branch revision 141291]] on linux2
Type "help", "copyright", "credits" or "license" for more information.

This would help users find the right binaries to install as
extension.
...
Another data point is the Mandriva Linux distribution.  It is probably
much smaller than Debian, Ubuntu, or RedHat, but it is still one of
the major, well-known distributions.  I requested of the Python
maintainer for Mandriva, Michael Scherer, that they switch from UCS2
to UCS4 in order to reduce compatibility problems like these.  His
answer as I understood it was that it is best to follow the
recommendations of the upstream Python devs by using the default
setting instead of choosing a setting for himself.
Which is IMHO what all Linux distributions should have done.

Distributions should really not be put in charge of upstream
coding design decisions.

Regards,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 28 2009)
...
...
...
Python/Zope Consulting and Support ...        http://www.egenix.com/
mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

Re: [Python-Dev] please consider changing --enable-unicode default to ucs4

M.-A. Lemburg

M.-A. Lemburg

James Y Knight

M.-A. Lemburg

"Martin v. Löwis"

Björn Lindqvist

Antoine Pitrou

"Martin v. Löwis"

Zooko O'Whielacronx

M.-A. Lemburg

Ronald Oussoren

M.-A. Lemburg

Ronald Oussoren

M.-A. Lemburg

Neil Hodgson

M.-A. Lemburg

James Y Knight

M.-A. Lemburg

"Martin v. Löwis"

Björn Lindqvist

Antoine Pitrou

"Martin v. Löwis"

Zooko O'Whielacronx

M.-A. Lemburg

Ronald Oussoren

M.-A. Lemburg

Ronald Oussoren

M.-A. Lemburg

Neil Hodgson

tags

participants (8)