[issue9133] Invalid UTF8 Byte sequence not raising exception/being substituted
report at bugs.python.org
Wed Jun 30 22:02:54 CEST 2010
New submission from Mike Lewis <mikelikespie at gmail.com>:
When I do
codecs.encode(codecs.decode('\xed\xbc\xad', 'utf8'), 'utf8')
its not throwing an exception. '\xed\xbc\xad' is an invalid UTF8 byte sequence.
It maps to the value U+DF2D which is a "surrogate pair" it seems.
However, pairs of
UCS-2 values between D800 and DFFF (surrogate pairs in Unicode
parlance), being actually UCS-4 characters transformed through
UTF-16, need special treatment: the UTF-16 transformation must be
undone, yielding a UCS-4 character that is then transformed as
which would suggest that it is invalid.
However, I think wikipedia's explanation is a bit clearer:
UTF-8 may only legally be used to encode valid Unicode scalar values. According to the Unicode standard the high and low surrogate halves used by UTF-16 (U+D800 through U+DFFF) and values above U+10FFFF are not legal Unicode values, and the UTF-8 encoding of them is an invalid byte sequence and should be treated as described above.
title: Invalid UTF8 Byte sequence not raising exception/being substituted
versions: Python 2.6
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list