[Python-Dev] please consider changing --enable-unicode default to ucs4

Thu Oct 8 02:10:25 CEST 2009

On Sun, Sep 20, 2009 at 10:17, Zooko O'Whielacronx <zookog at gmail.com> wrote:
> On Sun, Sep 20, 2009 at 8:27 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>> AFAIK, C extensions should fail loading when they have the wrong UCS2/4 setting.
>
> That would be an improvement!  Unfortunately we instead get mysterious
> misbehavior of the module, e.g.:
>
> http://bugs.python.org/setuptools/msg309
> http://allmydata.org/trac/tahoe/ticket/704#comment:5

The real issue here is getting confused because python's option is
misnamed.  We support UTF-16 and UTF-32, not UCS-2 and UCS-4.  This
means that when decoding UTF-8, any scalar value outside the BMP will
be split into a pair of surrogates on UTF-16 builds; if we were using
UCS-2 that'd be an error instead (and *nothing* would understand
surrogates.)

Yet we are getting an error here.  However, if you look at the details
you'll notice it's on a 6-byte UTF-8 code unit sequence, corresponding
in the second link to U+6E657770.  Although the originally UTF-8 left
open the possibility of including up to 31 bits (or U+7FFFFFFF), this
was removed in RFC 3629 and is now strictly prohibited.  The modern
unicode character set itself also imposes that restriction.  There is
nothing beyond U+10FFFF.  Nothing should create a such a high code
point, and even if it happened internally a RFC 3629-conformant UTF-8
encoder must refuse to pass it through.

Something more subtle must be going on.  Possibly several bugs (such
as a non-conformant encoder or garbage being misinterpreted as UTF-8).

-- 
Adam Olsen, aka Rhamphoryncus