[Python-Dev] PEP 393 Summer of Code Project

Fri Aug 26 04:28:06 CEST 2011

On Thu, 25 Aug 2011, Guido van Rossum wrote:

> I'm not sure what should happen with UTF-8 when it (in flagrant
> violation of the standard, I presume) contains two separately-encoded
> surrogates forming a valid surrogate pair; probably whatever the UTF-8
> codec does on a wide build today should be good enough. Similarly for
> encoding to UTF-8 on a wide build if one managed to create a string
> containing a surrogate pair. Basically, I'm for a
> garbage-in-garbage-out approach (with separate library functions to
> detect garbage if the app is worried about it).

If it's called UTF-8, there is no decision to be taken as to decoder 
behaviour - any byte sequence not permitted by the Unicode standard must 
result in an error (although, of course, *how* the error is to be reported 
could legitimately be the subject of endless discussion).  There are 
security implications to violating the standard so this isn't just 
legalistic purity.

Hmmm, doesn't look good:

Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\xed\xb0\x80'.decode ('utf-8')
u'\udc00'
>>>

Incorrect!  Although this is a narrow build - I can't say what the wide 
build would do.

For reasons of practicality, it may be appropriate to provide easy access 
to a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must 
not be called UTF-8.  Other variations may also find use if provided.

See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt

And CESU-8 technical report: http://www.unicode.org/reports/tr26/

Isaac Morland			CSCF Web Guru
DC 2554C, x36650		WWW Software Specialist