please consider changing --enable-unicode default to ucs4

Dear Pythonistas: This issue causes serious problems. Users occasionally get binaries built for a compatible Linux and Python version but with a different UCS2-vs-UCS4 setting, and those users get mysterious memory corruption errors which are hard to diagnose. It is possible that these situations also open up security vulnerabilities. A couple such instances are documented on http://bugs.python.org/setuptools/issue78, but you can find more by googling. I would like to get this problem fixed! In order to help address this issue I sampled what UCS size is used by python executables in the wild. I instrumented a few buildslaves that are contributed by various people to the Tahoe-LAFS project to print out their platform, python version, and sys.maxunicode. The full results are appended below. maxunicode: 1114111 means that python executable was configured with --enable-unicode=ucs4, and maxunicode: 65535 means that python executable was configured with --enable-unicode=ucs2 or just with --enable-unicode . The only incompatibilities that I found are because some packagers have deliberately set UCS4 configuration and other packagers have left the default setting. In the three cases where someone configured python with UCS2, one of the three is certainly an accident (a custom-built python executable on an Ubuntu server) and the other two just use the default instead of specifically configuring ucs2 in their configurations of Python and I suspect that they don't know the difference and that it was an accident that they built a Python incompatible with other distributions of their operating system. In sum, while it would be good to add the unicode setting to the platform's ABI (as discussed in setuptools ticket #78), it would also be good to make the default value be UCS4 instead of UCS2. This would fix all three of the potential incompatibilities that I found (listed below), and once we have proper inclusion of the unicode setting in the ABI in order to prevent the memory corruption, defaulting to UCS4 would increase the likelihood that a binary built on one distribution would be usable on another. I'm sure that someone can come up with a reason why UCS2 is better than UCS4, but I'm also sure that the benefits of compatibility outweigh any benefits of UCS2 encoding, and that the widespread use of UCS4 demonstrates that there is nothing fatally wrong with it, and that people who really value UCS2 encoding more than compatibility can choose that for themselves by explicitly setting UCS2. Let me restate that I am not suggesting taking away anyone's options, only making the setting for people who don't specify default to the compatible option. Hm, I guess that means that it should default to UCS2 on Windows and Mac and to UCS4 on Linux and Solaris. Regards, Zooko Ubuntu 6.10 "edgy" i386: python: 2.4.4c1 (#2, Mar 7 2008, 03:03:38) [GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)], maxunicode: 1114111 Ubuntu 7.04 "feisty": python: 2.5.1 (r251:54863, Jul 31 2008, 22:53:39) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)], maxunicode: 1114111 Ubuntu 7.10 "gutsy" i386: python: 2.5.1 (r251:54863, Jul 31 2008, 23:17:40) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)], maxunicode: 1114111 Ubuntu 8.04 "hardy" amd64: python: 2.5.2 (r252:60911, Jul 22 2009, 15:33:10) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111 Ubuntu 8.04 "hardy" i386: *custom* python: 2.6 (r26:66714, Oct 2 2008, 13:40:28) [GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)], maxunicode: 65535 Ubuntu 8.04 "hardy" i386: python: 2.5.2 (r252:60911, Jul 22 2009, 15:35:03) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)], maxunicode: 1114111 Ubuntu 9.04 "jaunty" amd64: *custom* python: 2.6.2 (release26-maint, Apr 19 2009, 01:58:18) [GCC 4.3.3], maxunicode: 1114111 Debian 4.0 "etch" i386: python: 2.4.4 (#2, Oct 22 2008, 19:52:44) [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)], maxunicode: 1114111 Debian 5.0 "lenny" i386: python: 2.5.2 (r252:60911, Jan 4 2009, 17:40:26) [GCC 4.3.2], maxunicode: 1114111 Debian 5.0 "lenny" amd64: python: 2.5.2 (r252:60911, Jan 4 2009, 21:59:32) [GCC 4.3.2], maxunicode: 1114111 Debian 5.0 "lenny" armv5tel: python: 2.5.2 (r252:60911, Jan 5 2009, 02:00:00) [GCC 4.3.2], maxunicode: 1114111 Debian unstable "squeeze/sid" i386: python: 2.5.4 (r254:67916, Feb 17 2009, 20:16:45) [GCC 4.3.3], maxunicode: 1114111 Fedora 11 "leonidas" amd64: python: 2.6 (r26:66714, Jul 4 2009, 17:37:13) [GCC 4.4.0 20090506 (Red Hat 4.4.0-4)], maxunicode: 1114111 ArchLinux: python: 2.6.2 (r262:71600, Jul 20 2009, 02:23:30) [GCC 4.4.0 20090630 (prerelease)], maxunicode: 65535 NetBSD 4: python: 2.5.2 (r252:60911, Mar 20 2009, 14:00:07) [GCC 4.1.2 20060628 prerelease (NetBSD nb2 20060711)], maxunicode: 65535 OpenSolaris SunOS-5.11-i86pc-i386-32bit: python: 2.4.4 (#1, Mar 10 2009, 09:35:36) [C], maxunicode: 65535 Nexenta NCP1 SunOS-5.11-i86pc-i386-32bit: python: 2.4.3 (#2, May 3 2006, 19:12:42) [GCC 4.0.3 (GNU_OpenSolaris 4.0.3-1nexenta4)], maxunicode: 1114111 Mac OS 10.6 "snow leopard" i386: python: 2.6.1 (r261:67515, Jul 7 2009, 23:51:51) [GCC 4.2.1 (Apple Inc. build 5646)], maxunicode: 65535 Mac OS 10.5 "leopard" i386: python: 2.5.1 (r251:54863, Feb 6 2009, 19:02:12) [GCC 4.0.1 (Apple Inc. build 5465)], maxunicode: 65535 Mac OS 10.4 "tiger" *custom* python: 2.5.4 (release25-maint:72153M, Apr 30 2009, 12:28:20) [GCC 4.0.1 (Apple Computer, Inc. build 5367)], maxunicode: 65535 Cygwin CYGWIN_NT-5.1-1.5.25-0.156-4-2-i686-32bit-WindowsPE: python: 2.5.2 (r252:60911, Dec 2 2008, 09:26:14) [GCC 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)], maxunicode: 65535 Windows: python: 2.6.2 (r262:71600, Apr 21 2009, 15:05:37) [MSC v.1500 32 bit (Intel)], maxunicode: 65535 Windows: python: 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)], maxunicode: 65535

2009/9/20 Zooko O'Whielacronx <zookog@gmail.com>:
You may want to have a look at the archives of the last time this was extensively discussed: http://mail.python.org/pipermail/python-dev/2008-July/080886.html -- Regards, Benjamin

I'm sorry, I should have mentioned that I did read those archives before I posted my letter. That discussion was all about whether UCS2 or UCS4 is better. I consider that question to be mostly irrelevant to this issue, which is about compatibility for people who don't choose to configure that setting themselves. Platforms or people who prefer UCS2 will continue to use it as appropriate. UCS4 is clearly good enough for the vast majority of Linux users, and having fewer mysterious segfaults and potential security vulnerabilities would be an important improvement to the user experience of Python on Linux. I should mention that the reason I'm spending time on this right now is that it is currently blocking me from being able to distribute binaries of Python packages which will work for all of my Linux users. Regards, Zooko

Zooko O'Whielacronx wrote:
You surely must have missed the sentence "For that reason I think it's also better that the configure script continues to default to UTF-16 -- this will give the UTF-16 support code the necessary exercise." This is effectively a BDFL pronouncement. Nothing has changed the validity of the premise of the statement, so the conclusion remains valid, as well. Regards, Martin

Zooko O'Whielacronx <zookog <at> gmail.com> writes:
What "binaries" are you talking about? AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting. That's the reason we have all those #define's in unicodeobject.h: the actual function names end up being different and, therefore, are not found when linking.
In order to help address this issue I sampled what UCS size is used by python executables in the wild.
For information, all Mandriva versions I've used until now have had their Python's built with UCS2 (maxunicode == 65535). Regards Antoine.

On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
What "binaries" are you talking about?
I mean extension modules with native code, which means .so shared library files on unix.
AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.
That would be an improvement! Unfortunately we instead get mysterious misbehavior of the module, e.g.: http://bugs.python.org/setuptools/msg309 http://allmydata.org/trac/tahoe/ticket/704#comment:5
For information, all Mandriva versions I've used until now have had their Python's built with UCS2 (maxunicode == 65535).
Thank you for the data point. This means that binary extension modules built on Mandriva can't be ported to Ubuntu or vice versa. However, is this an argument for or against changing the default setting to UCS4? Changing the default setting wouldn't interfere with Mandriva's decision, right? Regards, Zooko

Le Sun, 20 Sep 2009 10:17:45 -0600, Zooko O'Whielacronx a écrit :
The bug reports in themselves aren't very explicit, and they don't seem to be related to any native extension. So I'm not sure why you're talking about "mysterious memory corruption errors" in your original mail, because there doesn't seem to be such a thing happening at all. Please note that there's a bug related to a non-portable peephole optimization of some unicode constants, perhaps it may explain the aforementioned problems (perhaps not) : http://bugs.python.org/issue5057 I expect the solution to this bug to be rather easy (just disable the optimization, since it isn't really useful), but someone has to care enough to produce a patch.
"Ported" they can certainly be, you just have to recompile.
Well, let's put it this way: - either you expect the default setting to be observed by everyone, and it *will* interfere with someone's current decision - or you don't expect the default setting to be observed by everyone, and then there's no point in changing it because it won't stop your problems Either way, my mentioning of Mandriva was just meant as an additional data point to those you already provided ;-) Regards Antoine.

Zooko O'Whielacronx wrote:
Those will not load unless they are for the right UCS-version of Python. The extensions will give an ImportError if they are using any Unicode APIs - we go through great lengths in the Unicode API to make sure that you cannot mix UCS2 and UCS4 APIs. I'm not exactly sure what you are trying to achieve by making UCS4 the default... if you build extensions using the system Python version, distutils will automatically build the right UCS-version for you.
Those don't appear to be related to UCS2 vs. UCS4 but rather some problem with the UTF-8 data those users are trying to load. The fact that setuptools completely ignores the fact that Python UCS2 and UCS4 are two different Python builds, is not really a Python Unicode problem, but one of the setuptools design, so you should probably complain there.
Depends on what you mean with "ported": of course you can port a source RPM between UCS2 and UCS4 builds. This just requires a recompile. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 20 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sun, Sep 20, 2009 at 10:17, Zooko O'Whielacronx <zookog@gmail.com> wrote:
The real issue here is getting confused because python's option is misnamed. We support UTF-16 and UTF-32, not UCS-2 and UCS-4. This means that when decoding UTF-8, any scalar value outside the BMP will be split into a pair of surrogates on UTF-16 builds; if we were using UCS-2 that'd be an error instead (and *nothing* would understand surrogates.) Yet we are getting an error here. However, if you look at the details you'll notice it's on a 6-byte UTF-8 code unit sequence, corresponding in the second link to U+6E657770. Although the originally UTF-8 left open the possibility of including up to 31 bits (or U+7FFFFFFF), this was removed in RFC 3629 and is now strictly prohibited. The modern unicode character set itself also imposes that restriction. There is nothing beyond U+10FFFF. Nothing should create a such a high code point, and even if it happened internally a RFC 3629-conformant UTF-8 encoder must refuse to pass it through. Something more subtle must be going on. Possibly several bugs (such as a non-conformant encoder or garbage being misinterpreted as UTF-8). -- Adam Olsen, aka Rhamphoryncus

Adam Olsen wrote:
I agree that a better error message would help. I'm just not sure how to achieve that. The error message you currently see gets generated by the dynamic linker trying to resolve a Python Unicode API symbol: the API names are mangled to assure that you cannot mix UCS2 interpreters and UCS4 extensions (and vice-versa). We could try to scan the linker error message for 'Py.*UCS.' and then replace the message with a more helpful one (in importdl.c), but I'm not sure how portable that is. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 08 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
For information, all Mandriva versions I've used until now have had their Python's built with UCS2 (maxunicode == 65535).
By the way, I was investigating this, and discovered an issue on the Mandriva tracker which suggests that they intend to switch to UCS4 in the next release in order to avoid compatibility problems like these. (Not because they think that UCS4 is better than UCS2.) https://qa.mandriva.com/show_bug.cgi?id=48570 Regards, Zooko

Le Sun, 20 Sep 2009 10:33:23 -0600, Zooko O'Whielacronx a écrit :
Trying to use a Fedora or Suse RPM under Mandriva (or the other way round) isn't reasonable and is certainly not supported. I don't understand why this so-called "compatibility problem" should be taken seriously by anyone. Regards Antoine.

Dne 20.9.2009 18:42, Antoine Pitrou napsal(a):
You're not making sense. No distro is an island - plus, upstream distributors have a nasty habit of providing RPMs only for Fedora. I don't see what is bad about improving compatibility in a place where the setting doesn't hurt one way or the other. Besides, the more compatibility we achieve now, the easier time we'll have once python makes it into LSB regards m.

Le lundi 05 octobre 2009 à 19:18 +0200, Jan Matejek a écrit :
I can't speak for Mandriva, but I'm sure they care more about not breaking user installs when they upgrade to Mandriva X + 1, than about making it possible to use Fedora RMPs on Mandriva. In any case, this is quite off-topic for python-dev. If you are motivated about this, you should try to convinve the Mandriva developers instead. Regards Antoine.

2009/9/20 Zooko O'Whielacronx <zookog@gmail.com>:
You may want to have a look at the archives of the last time this was extensively discussed: http://mail.python.org/pipermail/python-dev/2008-July/080886.html -- Regards, Benjamin

I'm sorry, I should have mentioned that I did read those archives before I posted my letter. That discussion was all about whether UCS2 or UCS4 is better. I consider that question to be mostly irrelevant to this issue, which is about compatibility for people who don't choose to configure that setting themselves. Platforms or people who prefer UCS2 will continue to use it as appropriate. UCS4 is clearly good enough for the vast majority of Linux users, and having fewer mysterious segfaults and potential security vulnerabilities would be an important improvement to the user experience of Python on Linux. I should mention that the reason I'm spending time on this right now is that it is currently blocking me from being able to distribute binaries of Python packages which will work for all of my Linux users. Regards, Zooko

Zooko O'Whielacronx wrote:
You surely must have missed the sentence "For that reason I think it's also better that the configure script continues to default to UTF-16 -- this will give the UTF-16 support code the necessary exercise." This is effectively a BDFL pronouncement. Nothing has changed the validity of the premise of the statement, so the conclusion remains valid, as well. Regards, Martin

Zooko O'Whielacronx <zookog <at> gmail.com> writes:
What "binaries" are you talking about? AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting. That's the reason we have all those #define's in unicodeobject.h: the actual function names end up being different and, therefore, are not found when linking.
In order to help address this issue I sampled what UCS size is used by python executables in the wild.
For information, all Mandriva versions I've used until now have had their Python's built with UCS2 (maxunicode == 65535). Regards Antoine.

On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
What "binaries" are you talking about?
I mean extension modules with native code, which means .so shared library files on unix.
AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.
That would be an improvement! Unfortunately we instead get mysterious misbehavior of the module, e.g.: http://bugs.python.org/setuptools/msg309 http://allmydata.org/trac/tahoe/ticket/704#comment:5
For information, all Mandriva versions I've used until now have had their Python's built with UCS2 (maxunicode == 65535).
Thank you for the data point. This means that binary extension modules built on Mandriva can't be ported to Ubuntu or vice versa. However, is this an argument for or against changing the default setting to UCS4? Changing the default setting wouldn't interfere with Mandriva's decision, right? Regards, Zooko

Le Sun, 20 Sep 2009 10:17:45 -0600, Zooko O'Whielacronx a écrit :
The bug reports in themselves aren't very explicit, and they don't seem to be related to any native extension. So I'm not sure why you're talking about "mysterious memory corruption errors" in your original mail, because there doesn't seem to be such a thing happening at all. Please note that there's a bug related to a non-portable peephole optimization of some unicode constants, perhaps it may explain the aforementioned problems (perhaps not) : http://bugs.python.org/issue5057 I expect the solution to this bug to be rather easy (just disable the optimization, since it isn't really useful), but someone has to care enough to produce a patch.
"Ported" they can certainly be, you just have to recompile.
Well, let's put it this way: - either you expect the default setting to be observed by everyone, and it *will* interfere with someone's current decision - or you don't expect the default setting to be observed by everyone, and then there's no point in changing it because it won't stop your problems Either way, my mentioning of Mandriva was just meant as an additional data point to those you already provided ;-) Regards Antoine.

Zooko O'Whielacronx wrote:
Those will not load unless they are for the right UCS-version of Python. The extensions will give an ImportError if they are using any Unicode APIs - we go through great lengths in the Unicode API to make sure that you cannot mix UCS2 and UCS4 APIs. I'm not exactly sure what you are trying to achieve by making UCS4 the default... if you build extensions using the system Python version, distutils will automatically build the right UCS-version for you.
Those don't appear to be related to UCS2 vs. UCS4 but rather some problem with the UTF-8 data those users are trying to load. The fact that setuptools completely ignores the fact that Python UCS2 and UCS4 are two different Python builds, is not really a Python Unicode problem, but one of the setuptools design, so you should probably complain there.
Depends on what you mean with "ported": of course you can port a source RPM between UCS2 and UCS4 builds. This just requires a recompile. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Sep 20 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sun, Sep 20, 2009 at 10:17, Zooko O'Whielacronx <zookog@gmail.com> wrote:
The real issue here is getting confused because python's option is misnamed. We support UTF-16 and UTF-32, not UCS-2 and UCS-4. This means that when decoding UTF-8, any scalar value outside the BMP will be split into a pair of surrogates on UTF-16 builds; if we were using UCS-2 that'd be an error instead (and *nothing* would understand surrogates.) Yet we are getting an error here. However, if you look at the details you'll notice it's on a 6-byte UTF-8 code unit sequence, corresponding in the second link to U+6E657770. Although the originally UTF-8 left open the possibility of including up to 31 bits (or U+7FFFFFFF), this was removed in RFC 3629 and is now strictly prohibited. The modern unicode character set itself also imposes that restriction. There is nothing beyond U+10FFFF. Nothing should create a such a high code point, and even if it happened internally a RFC 3629-conformant UTF-8 encoder must refuse to pass it through. Something more subtle must be going on. Possibly several bugs (such as a non-conformant encoder or garbage being misinterpreted as UTF-8). -- Adam Olsen, aka Rhamphoryncus

Adam Olsen wrote:
I agree that a better error message would help. I'm just not sure how to achieve that. The error message you currently see gets generated by the dynamic linker trying to resolve a Python Unicode API symbol: the API names are mangled to assure that you cannot mix UCS2 interpreters and UCS4 extensions (and vice-versa). We could try to scan the linker error message for 'Py.*UCS.' and then replace the message with a more helpful one (in importdl.c), but I'm not sure how portable that is. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 08 2009)
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
For information, all Mandriva versions I've used until now have had their Python's built with UCS2 (maxunicode == 65535).
By the way, I was investigating this, and discovered an issue on the Mandriva tracker which suggests that they intend to switch to UCS4 in the next release in order to avoid compatibility problems like these. (Not because they think that UCS4 is better than UCS2.) https://qa.mandriva.com/show_bug.cgi?id=48570 Regards, Zooko

Le Sun, 20 Sep 2009 10:33:23 -0600, Zooko O'Whielacronx a écrit :
Trying to use a Fedora or Suse RPM under Mandriva (or the other way round) isn't reasonable and is certainly not supported. I don't understand why this so-called "compatibility problem" should be taken seriously by anyone. Regards Antoine.

Dne 20.9.2009 18:42, Antoine Pitrou napsal(a):
You're not making sense. No distro is an island - plus, upstream distributors have a nasty habit of providing RPMs only for Fedora. I don't see what is bad about improving compatibility in a place where the setting doesn't hurt one way or the other. Besides, the more compatibility we achieve now, the easier time we'll have once python makes it into LSB regards m.

Le lundi 05 octobre 2009 à 19:18 +0200, Jan Matejek a écrit :
I can't speak for Mandriva, but I'm sure they care more about not breaking user installs when they upgrade to Mandriva X + 1, than about making it possible to use Fedora RMPs on Mandriva. In any case, this is quite off-topic for python-dev. If you are motivated about this, you should try to convinve the Mandriva developers instead. Regards Antoine.
participants (7)
-
"Martin v. Löwis"
-
Adam Olsen
-
Antoine Pitrou
-
Benjamin Peterson
-
Jan Matejek
-
M.-A. Lemburg
-
Zooko O'Whielacronx