[Python-Dev] PEP 393 Summer of Code Project

Stefan Behnel stefan_ml at behnel.de
Fri Aug 26 06:35:26 CEST 2011


Isaac Morland, 26.08.2011 04:28:
> On Thu, 25 Aug 2011, Guido van Rossum wrote:
>> I'm not sure what should happen with UTF-8 when it (in flagrant
>> violation of the standard, I presume) contains two separately-encoded
>> surrogates forming a valid surrogate pair; probably whatever the UTF-8
>> codec does on a wide build today should be good enough. Similarly for
>> encoding to UTF-8 on a wide build if one managed to create a string
>> containing a surrogate pair. Basically, I'm for a
>> garbage-in-garbage-out approach (with separate library functions to
>> detect garbage if the app is worried about it).
>
> If it's called UTF-8, there is no decision to be taken as to decoder
> behaviour - any byte sequence not permitted by the Unicode standard must
> result in an error (although, of course, *how* the error is to be reported
> could legitimately be the subject of endless discussion). There are
> security implications to violating the standard so this isn't just
> legalistic purity.
>
> Hmmm, doesn't look good:
>
> Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
> [GCC 4.2.1 (Apple Inc. build 5646)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
> >>> '\xed\xb0\x80'.decode ('utf-8')
> u'\udc00'
> >>>
>
> Incorrect! Although this is a narrow build - I can't say what the wide
> build would do.

Works the same for me in a wide Py2.7 build, but gives me this in Py3:

Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> b'\xed\xb0\x80'.decode ('utf-8')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: 
illegal encoding

Same for current Py3.3 and the PEP393 build (although both have a better 
exception message now: "UnicodeDecodeError: 'utf8' codec can't decode bytes 
in position 0-1: invalid continuation byte").

Stefan



More information about the Python-Dev mailing list