[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 04:59:10 CEST 2011

On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <ijmorlan at uwaterloo.ca> wrote:
> On Thu, 25 Aug 2011, Guido van Rossum wrote:
>
>> I'm not sure what should happen with UTF-8 when it (in flagrant
>> violation of the standard, I presume) contains two separately-encoded
>> surrogates forming a valid surrogate pair; probably whatever the UTF-8
>> codec does on a wide build today should be good enough. Similarly for
>> encoding to UTF-8 on a wide build if one managed to create a string
>> containing a surrogate pair. Basically, I'm for a
>> garbage-in-garbage-out approach (with separate library functions to
>> detect garbage if the app is worried about it).
>
> If it's called UTF-8, there is no decision to be taken as to decoder
> behaviour - any byte sequence not permitted by the Unicode standard must
> result in an error (although, of course, *how* the error is to be reported
> could legitimately be the subject of endless discussion).  There are
> security implications to violating the standard so this isn't just
> legalistic purity.

You have a point. The security issues cannot be seen separate from all
the other issues. The folks inside Google who care about Unicode often
harp on this. So I stand corrected. I am fine with codecs treating
code points or code point sequences that the Unicode standard doesn't
like (e.g. lone surrogates) the same way as more severe errors in the
encoded bytes (lots of byte sequences already aren't valid UTF-8). I
just hope this doesn't require normal forms or other expensive
operations; I hope it's limited to rejecting invalid use of surrogates
or other values that are not valid code points (e.g. 0, or >= 2**21).

> Hmmm, doesn't look good:
>
> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>>
>>>> '\xed\xb0\x80'.decode ('utf-8')
>
> u'\udc00'
>>>>
>
> Incorrect!  Although this is a narrow build - I can't say what the wide
> build would do.
>
> For reasons of practicality, it may be appropriate to provide easy access to
> a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not be
> called UTF-8.  Other variations may also find use if provided.
>
> See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt
>
> And CESU-8 technical report: http://www.unicode.org/reports/tr26/

Thanks for the links! I also like the term "supplemental character" (a
code point >= 2**16). And I note that they talk about characters were
we've just agreed that we should say code points...

-- 
--Guido van Rossum (python.org/~guido)