[Python-Dev] please consider changing --enable-unicode default to ucs4

Zooko O'Whielacronx zookog at gmail.com
Sun Sep 20 16:02:09 CEST 2009


Dear Pythonistas:

This issue causes serious problems.  Users occasionally get binaries built for a
compatible Linux and Python version but with a different UCS2-vs-UCS4 setting,
and those users get mysterious memory corruption errors which are hard to
diagnose.  It is possible that these situations also open up security
vulnerabilities.  A couple such instances are documented on
http://bugs.python.org/setuptools/issue78, but you can find more by googling.
I would like to get this problem fixed!

In order to help address this issue I sampled what UCS size is used by python
executables in the wild.  I instrumented a few buildslaves that are
contributed by
various people to the Tahoe-LAFS project to print out their platform,
python version,
and sys.maxunicode.  The full results are appended below.  maxunicode: 1114111
means that python executable was configured with --enable-unicode=ucs4, and
maxunicode: 65535 means that python executable was configured with
--enable-unicode=ucs2 or just with --enable-unicode .  The only
incompatibilities
that I found are because some packagers have deliberately set UCS4
configuration and other packagers have left the default setting.

In the three cases where someone configured python with UCS2, one of the three
is certainly an accident (a custom-built python executable on an Ubuntu server)
and the other two just use the default instead of specifically configuring ucs2
in their configurations of Python and I suspect that they don't know the
difference and that it was an accident that they built a Python incompatible
with other distributions of their operating system.

In sum, while it would be good to add the unicode setting to the platform's ABI
(as discussed in setuptools ticket #78), it would also be good to make
the default
value be UCS4 instead of UCS2.  This would fix all three of the potential
incompatibilities that I found (listed below), and once we have proper inclusion
of the unicode setting in the ABI in order to prevent the memory corruption,
defaulting to UCS4 would increase the likelihood that a binary built on one
distribution would be usable on another.

I'm sure that someone can come up with a reason why UCS2 is better than UCS4,
but I'm also sure that the benefits of compatibility outweigh any benefits of
UCS2 encoding, and that the widespread use of UCS4 demonstrates that there is
nothing fatally wrong with it, and that people who really value UCS2 encoding
more than compatibility can choose that for themselves by explicitly
setting UCS2.

Let me restate that I am not suggesting taking away anyone's options, only
making the setting for people who don't specify default to the
compatible option.
Hm, I guess that means that it should default to UCS2 on Windows and Mac and
to UCS4 on Linux and Solaris.

Regards,

Zooko

Ubuntu 6.10 "edgy" i386: python: 2.4.4c1 (#2, Mar  7 2008, 03:03:38)  [GCC 4.1.2
20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)], maxunicode: 1114111
Ubuntu 7.04 "feisty": python: 2.5.1 (r251:54863, Jul 31 2008, 22:53:39)  [GCC
4.1.2 (Ubuntu 4.1.2-0ubuntu4)], maxunicode: 1114111
Ubuntu 7.10 "gutsy" i386: python: 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)], maxunicode: 1114111
Ubuntu 8.04 "hardy" amd64: python: 2.5.2 (r252:60911, Jul 22 2009, 15:33:10)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111
Ubuntu 8.04 "hardy" i386: *custom* python: 2.6 (r26:66714, Oct  2 2008,
13:40:28)  [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)], maxunicode: 65535
Ubuntu 8.04 "hardy" i386: python: 2.5.2 (r252:60911, Jul 22 2009, 15:35:03)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111
Ubuntu 9.04 "jaunty" amd64: *custom* python: 2.6.2 (release26-maint, Apr 19
2009, 01:58:18)  [GCC 4.3.3], maxunicode: 1114111

Debian 4.0 "etch" i386: python: 2.4.4 (#2, Oct 22 2008, 19:52:44)  [GCC 4.1.2
20061115 (prerelease) (Debian 4.1.1-21)], maxunicode: 1114111
Debian 5.0 "lenny" i386: python: 2.5.2 (r252:60911, Jan  4 2009, 17:40:26)  [GCC
4.3.2], maxunicode: 1114111
Debian 5.0 "lenny" amd64: python: 2.5.2 (r252:60911, Jan  4 2009, 21:59:32)
[GCC 4.3.2], maxunicode: 1114111
Debian 5.0 "lenny" armv5tel: python: 2.5.2 (r252:60911, Jan  5 2009, 02:00:00)
[GCC 4.3.2], maxunicode: 1114111
Debian unstable "squeeze/sid" i386: python: 2.5.4 (r254:67916, Feb 17 2009,
20:16:45)  [GCC 4.3.3], maxunicode: 1114111

Fedora 11 "leonidas" amd64: python: 2.6 (r26:66714, Jul  4 2009, 17:37:13)  [GCC
4.4.0 20090506 (Red Hat 4.4.0-4)], maxunicode: 1114111

ArchLinux: python: 2.6.2 (r262:71600, Jul 20 2009, 02:23:30)  [GCC 4.4.0
20090630 (prerelease)], maxunicode: 65535

NetBSD 4: python: 2.5.2 (r252:60911, Mar 20 2009, 14:00:07)  [GCC 4.1.2 20060628
prerelease (NetBSD nb2 20060711)], maxunicode: 65535

OpenSolaris SunOS-5.11-i86pc-i386-32bit: python: 2.4.4 (#1, Mar 10 2009,
09:35:36) [C], maxunicode: 65535
Nexenta NCP1 SunOS-5.11-i86pc-i386-32bit: python: 2.4.3 (#2, May  3 2006,
19:12:42)  [GCC 4.0.3 (GNU_OpenSolaris 4.0.3-1nexenta4)], maxunicode: 1114111

Mac OS 10.6 "snow leopard" i386: python: 2.6.1 (r261:67515, Jul  7 2009,
23:51:51)  [GCC 4.2.1 (Apple Inc. build 5646)], maxunicode: 65535
Mac OS 10.5 "leopard" i386: python: 2.5.1 (r251:54863, Feb  6 2009, 19:02:12)
[GCC 4.0.1 (Apple Inc. build 5465)], maxunicode: 65535
Mac OS 10.4 "tiger" *custom* python: 2.5.4 (release25-maint:72153M, Apr 30 2009,
12:28:20)  [GCC 4.0.1 (Apple Computer, Inc. build 5367)], maxunicode: 65535

Cygwin CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE: python: 2.5.2
(r252:60911, Dec  2 2008, 09:26:14)  [GCC 3.4.4 (cygming special, gdc 0.12,
using dmd 0.125)], maxunicode: 65535

Windows: python: 2.6.2 (r262:71600, Apr 21 2009, 15:05:37) [MSC v.1500 32 bit
(Intel)], maxunicode: 65535
Windows: python: 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)], maxunicode: 65535


More information about the Python-Dev mailing list