On Sun, Sep 20, 2009 at 10:17, Zooko O'Whielacronx
On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou
wrote: AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.
That would be an improvement! Unfortunately we instead get mysterious misbehavior of the module, e.g.:
http://bugs.python.org/setuptools/msg309 http://allmydata.org/trac/tahoe/ticket/704#comment:5
The real issue here is getting confused because python's option is misnamed. We support UTF-16 and UTF-32, not UCS-2 and UCS-4. This means that when decoding UTF-8, any scalar value outside the BMP will be split into a pair of surrogates on UTF-16 builds; if we were using UCS-2 that'd be an error instead (and *nothing* would understand surrogates.) Yet we are getting an error here. However, if you look at the details you'll notice it's on a 6-byte UTF-8 code unit sequence, corresponding in the second link to U+6E657770. Although the originally UTF-8 left open the possibility of including up to 31 bits (or U+7FFFFFFF), this was removed in RFC 3629 and is now strictly prohibited. The modern unicode character set itself also imposes that restriction. There is nothing beyond U+10FFFF. Nothing should create a such a high code point, and even if it happened internally a RFC 3629-conformant UTF-8 encoder must refuse to pass it through. Something more subtle must be going on. Possibly several bugs (such as a non-conformant encoder or garbage being misinterpreted as UTF-8). -- Adam Olsen, aka Rhamphoryncus