[Python-Dev] PEP 393 Summer of Code Project
Stefan Behnel
stefan_ml at behnel.de
Fri Aug 26 06:35:26 CEST 2011
Isaac Morland, 26.08.2011 04:28:
> On Thu, 25 Aug 2011, Guido van Rossum wrote:
>> I'm not sure what should happen with UTF-8 when it (in flagrant
>> violation of the standard, I presume) contains two separately-encoded
>> surrogates forming a valid surrogate pair; probably whatever the UTF-8
>> codec does on a wide build today should be good enough. Similarly for
>> encoding to UTF-8 on a wide build if one managed to create a string
>> containing a surrogate pair. Basically, I'm for a
>> garbage-in-garbage-out approach (with separate library functions to
>> detect garbage if the app is worried about it).
>
> If it's called UTF-8, there is no decision to be taken as to decoder
> behaviour - any byte sequence not permitted by the Unicode standard must
> result in an error (although, of course, *how* the error is to be reported
> could legitimately be the subject of endless discussion). There are
> security implications to violating the standard so this isn't just
> legalistic purity.
>
> Hmmm, doesn't look good:
>
> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> '\xed\xb0\x80'.decode ('utf-8')
> u'\udc00'
> >>>
>
> Incorrect! Although this is a narrow build - I can't say what the wide
> build would do.
Works the same for me in a wide Py2.7 build, but gives me this in Py3:
Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> b'\xed\xb0\x80'.decode ('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
illegal encoding
Same for current Py3.3 and the PEP393 build (although both have a better
exception message now: "UnicodeDecodeError: 'utf8' codec can't decode bytes
in position 0-1: invalid continuation byte").
Stefan
More information about the Python-Dev
mailing list