[Python-Dev] PEP 393 Summer of Code Project
Isaac Morland
ijmorlan at uwaterloo.ca
Fri Aug 26 04:28:06 CEST 2011
On Thu, 25 Aug 2011, Guido van Rossum wrote:
> I'm not sure what should happen with UTF-8 when it (in flagrant
> violation of the standard, I presume) contains two separately-encoded
> surrogates forming a valid surrogate pair; probably whatever the UTF-8
> codec does on a wide build today should be good enough. Similarly for
> encoding to UTF-8 on a wide build if one managed to create a string
> containing a surrogate pair. Basically, I'm for a
> garbage-in-garbage-out approach (with separate library functions to
> detect garbage if the app is worried about it).
If it's called UTF-8, there is no decision to be taken as to decoder
behaviour - any byte sequence not permitted by the Unicode standard must
result in an error (although, of course, *how* the error is to be reported
could legitimately be the subject of endless discussion). There are
security implications to violating the standard so this isn't just
legalistic purity.
Hmmm, doesn't look good:
Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\xed\xb0\x80'.decode ('utf-8')
u'\udc00'
>>>
Incorrect! Although this is a narrow build - I can't say what the wide
build would do.
For reasons of practicality, it may be appropriate to provide easy access
to a CESU-8 decoder in addition to the normal UTF-8 decoder, but it must
not be called UTF-8. Other variations may also find use if provided.
See UTF-8 RFC: http://www.ietf.org/rfc/rfc3629.txt
And CESU-8 technical report: http://www.unicode.org/reports/tr26/
Isaac Morland CSCF Web Guru
DC 2554C, x36650 WWW Software Specialist
More information about the Python-Dev
mailing list