[issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
report at bugs.python.org
Thu Apr 1 16:43:24 CEST 2010
John Machin <sjmachin at users.sourceforge.net> added the comment:
Preamble: pardon my ignorance of how the codebase works, but trunk unicodeobject.c is r79494 (and allows encoding of surrogate codepoints), py3k unicodeobject.c is r79506 (and bans the surrogate caper) and I can't find the r79542 that the patch mentions ... help, please!
length 2 case:
1. the loop can be hand-unrolled into oblivion. It can be entered only when s & 0xC0 != 0x80 (previous if test).
2. the over-long check (if (ch < 0x80)) hasn't been touched. It could be removed and the entries for C0 and C1 in the utf8_code_length array set to 0.
length 3 case:
1. the tests involving s being 0xE0 or 0xED are misplaced.
2. the test s == 0xE0 && s < 0xA0 if not misplaced would be shadowing the over-long test (ch < 0x800). It seems better to use the over-long test (with endinpos set to 1).
3. The test s == 0xED relates to the surrogates caper which in the py3k version is handled in the same place as the over-long test.
4. unrolling loop: needs no loop, only 1 test ... if s is good, then we know s must be bad without testing it, because we start the for loop only when s is bad || s is bad.
length 4 case: as for the len 3 case generally ... misplaced tests, F1 test shadows over-long test, F4 test shadows max value test, too many loop iterations.
Python tracker <report at bugs.python.org>
More information about the Python-bugs-list