[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0

Marc-Andre Lemburg report at bugs.python.org
Thu Apr 1 09:44:50 CEST 2010

Marc-Andre Lemburg <mal at egenix.com> added the comment:

John Machin wrote:
> John Machin <sjmachin at users.sourceforge.net> added the comment:
> @lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.

I just had a quick look at the code and saw that it's testing for the high
bit on the subsequent bytes.

Looking closer, you're right and the situation is a bit more complex,
but the solution still looks simple: only the endinpos
has to be adjusted more carefully depending on what the various
checks find.

That said, I find the Unicode consortium solution a bit awkward.
In UTF-8 the first byte in a multi-byte sequence defines the number
of bytes that make up a sequence. If some of those bytes are invalid,
the whole sequence is invalid and the fact that some of those
bytes may be interpretable as regular code points does not necessarily
result in better results - the reason is that loss of bytes in a
stream is far more unlikely than flipping a few bits in the data.

title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0 -> str.decode('utf8',	'replace') -- conformance with Unicode 5.2.0

Python tracker <report at bugs.python.org>

More information about the Python-bugs-list mailing list