[Python-Dev] please consider changing --enable-unicode default to ucs4

Mon Sep 28 18:12:56 CEST 2009

James Y Knight wrote:
> On Sep 28, 2009, at 4:25 AM, M.-A. Lemburg wrote:
>> Distributions should really not be put in charge of upstream
>> coding design decisions.
> 
> I don't think you can blame distros for this one....
>
> From PEP 0261:
>     It is also proposed that one day --enable-unicode will just
>     default to the width of your platforms wchar_t.
> 
> On linux, wchar_t is 4 bytes.

The PEP also has this to say:

    This has the effect of doubling the size of most Unicode
    strings. In order to avoid imposing this cost on every
    user, Python 2.2 will allow the 4-byte implementation as a
    build-time option. Users can choose whether they care about
    wide characters or prefer to preserve memory.

And that's still true today. It was the main reason for not
making it the default on those days. Today, Python 3.x
uses Unicode for all strings, so while the RAM situation has
changed somewhat since Python 2.2, the change has a much
wider effect on the Python memory foot-print than in late 2001.

> If there's a consensus amongst python upstream that all the distros
> should be shipping Python with UCS2 unicode strings, you should reach
> out to them and say this, in a rather more clear fashion. Currently,
> most signs point towards UCS4 builds as being the better option.

UCS4 is the better option if you use lots of non-BMP code points
and if you have to regularly interface with C APIs using wchar_t
on Unix.

> Or, one might reasonably wonder why UCS-4 is an option at all, if nobody
> should enable it.

See above: there are use cases where this does make a lot of sense.

E.g. non-BMP code points can only be represented using surrogates on
UCS2 builds and these can be tricky to deal with (or at least
many people feel like it's tricky to deal with them ;-).

>> People building their own Python version will usually also build
>> their own extensions, so I don't really believe that the above
>> scenario is very common.
> 
> I'd just like to note that I've run into this trap multiple times. I
> built a custom python, and expected it to work with all the existing,
> installed, extensions (same major version as the system install, just
> patched). And then had to build it again with UCS4, for it to actually
> work. Of course building twice isn't the end of the world, and I'm
> certainly used to having to twiddle build options on software to get it
> working, but, this *does* happen, and *is* a tiny bit irritating.

Which is why I think that Python should include some more information
on the type of built being used, e.g. by placing the information
prominently on the startup line.

I still don't believe the above use case is a common one, though.

That said, Zooko's original motivation for the proposed change
is making installation of extensions easier for users. That's
a tools question much more than a Python Unicode one.

Aside: --enable-unicode is gone in Python 3.x. You now only
have the choice to use the default (which is UCS2) or switch on
the optional support for UCS4 by using --with-wide-unicode.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Sep 28 2009)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/