[New-bugs-announce] [issue8271] str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
John Machin
report at bugs.python.org
Wed Mar 31 04:28:12 CEST 2010
New submission from John Machin <sjmachin at users.sourceforge.net>:
Unicode 5.2.0 chapter 3 (Conformance) has a new section (headed "Constraints on Conversion Processes) after requirement D93. Recent Pythons e.g. 3.1.2 don't comply. Using the Unicode example:
>>> print(ascii(b"\xc2\x41\x42".decode('utf8', 'replace')))
'\ufffdB'
# should produce u'\ufffdAB'
Resynchronisation currently starts at a position derived by considering the length implied by the start byte:
>>> print(ascii(b"\xf1ABCD".decode('utf8', 'replace')))
'\ufffdD'
# should produce u'\ufffdABCD'; resync should start from the *failing* byte.
Notes: This applies to the 'ignore' option as well as the 'replace' option. The Unicode discussion mentions "security exploits".
----------
messages: 101972
nosy: sjmachin
severity: normal
status: open
title: str.decode('utf8', 'replace') -- conformance with Unicode 5.2.0
type: behavior
versions: Python 2.7, Python 3.1
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue8271>
_______________________________________
More information about the New-bugs-announce
mailing list