[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 11:14:13 CEST 2011

On Fri, Aug 26, 2011 at 5:59 AM, Guido van Rossum <guido at python.org> wrote:

> On Thu, Aug 25, 2011 at 7:28 PM, Isaac Morland <ijmorlan at uwaterloo.ca>
> wrote:
> > On Thu, 25 Aug 2011, Guido van Rossum wrote:
> >
> >> I'm not sure what should happen with UTF-8 when it (in flagrant
> >> violation of the standard, I presume) contains two separately-encoded
> >> surrogates forming a valid surrogate pair; probably whatever the UTF-8
> >> codec does on a wide build today should be good enough.
>

Surrogates are used and valid only in UTF-16.
In UTF-8/32 they are invalid, even if they are in pair (see
http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf ).  Of course Python
can/should be able to represent them internally regardless of the build
type.

>>Similarly for
> >> encoding to UTF-8 on a wide build if one managed to create a string
> >> containing a surrogate pair. Basically, I'm for a
> >> garbage-in-garbage-out approach (with separate library functions to
> >> detect garbage if the app is worried about it).
> >
> > If it's called UTF-8, there is no decision to be taken as to decoder
> > behaviour - any byte sequence not permitted by the Unicode standard must
> > result in an error (although, of course, *how* the error is to be
> reported
> > could legitimately be the subject of endless discussion).
>

What do you mean?  We use the "strict" error handler by default and we can
specify other handlers already.

>  There are
> > security implications to violating the standard so this isn't just
> > legalistic purity.
>
> You have a point. The security issues cannot be seen separate from all
> the other issues. The folks inside Google who care about Unicode often
> harp on this. So I stand corrected. I am fine with codecs treating
> code points or code point sequences that the Unicode standard doesn't
> like (e.g. lone surrogates) the same way as more severe errors in the
> encoded bytes (lots of byte sequences already aren't valid UTF-8).

Codecs that use the official names should stick to the standards.  For
example s.encode('utf-32') should either produce a valid utf-32 byte string
or raise an error if 's' contains invalid characters (e.g. surrogates).
We can have other internal codecs that are based on the UTF-* encodings but
allow the representation of lone surrogates and even expose them if we want,
but they should have a different name (even 'utf-*-something' should be ok,
see http://bugs.python.org/issue12729#msg142053 from "Unicode says you can't
put surrogates or noncharacters in a UTF-anything stream.").

> I
> just hope this doesn't require normal forms or other expensive
> operations; I hope it's limited to rejecting invalid use of surrogates
> or other values that are not valid code points (e.g. 0, or >= 2**21).
>

I think there shouldn't be any normalization done automatically by the
codecs.

>
> > Hmmm, doesn't look good:
> >
> > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
> > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> > Type "help", "copyright", "credits" or "license" for more information.
> >>>>
> >>>> '\xed\xb0\x80'.decode ('utf-8')
> >
> > u'\udc00'
> >>>>
> >
> > Incorrect!  Although this is a narrow build - I can't say what the wide
> > build would do.
>

The UTF-8 codec used to follow RFC 2279 and only recently has been updated
to RFC 3629 (see http://bugs.python.org/issue8271#msg107074 ).  On Python
2.x it still produces invalid UTF-8 because changing it is backward
incompatible.  In Python 2 UTF-8 can be used to encode every codepoint from
0 to 10FFFF, and it always works.  If we change it now it might start
raising errors for an operation that never raised them before (see
http://bugs.python.org/issue12729#msg142047 ).
Luckily this is fixed in Python 3.x.
I think there are more codepoints/byte sequences that should be rejected
while encoding/decoding though, in both UTF-8 and UTF-16/32, but I haven't
looked at them yet (I would be happy to fix these for 3.3 or even 2.7/3.2
(if applicable), so if you find mismatches with the Unicode standard and
report an issue, feel free to assign it to me).

Best Regards,
Ezio Melotti

>
> > For reasons of practicality, it may be appropriate to provide easy access
> to
> > a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must not
> be
> > called UTF-8.  Other variations may also find use if provided.
> >
> > See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt
> >
> > And CESU-8 technical report: http://www.unicode.org/reports/tr26/
>
> Thanks for the links! I also like the term "supplemental character" (a
> code point >= 2**16). And I note that they talk about characters were
> we've just agreed that we should say code points...
>
> --
> --Guido van Rossum (python.org/~guido <http://python.org/%7Eguido>)
> _______________________________________________
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110826/928797bf/attachment.html>