[New-bugs-announce] [issue17909] Autodetecting JSON encoding
report at bugs.python.org
Sun May 5 15:10:30 CEST 2013
New submission from Serhiy Storchaka:
RFC 4627 specifies a method to determine an encoding (one of UTF-8, UTF-16(BE|LE) or UTF-32(BE|LE)) of encoded JSON text. The proposed preliminary patch (it doesn't include the documentation yet) allows load() and loads() functions accept bytes data when it is encoded with standard Unicode encoding. Also accepted data with BOM (this doesn't specified in RFC 4627, but is widely used).
There is only one case where the method can give a misfire. Serialized string "\x00..." encoded in UTF-16LE may be erroneously detected as encoded in UTF-32LE. This case violates the two rules of RFC 4627: the string was serialized instead of a an object or an array, and the control character U+0000 was not escaped. The standard encoded JSON always detected correctly.
This patch requires "surrogatepass" error handler for utf-16/32 (see issue12892 and issue13916).
components: Library (Lib), Unicode
nosy: ezio.melotti, pitrou, rhettinger, serhiy.storchaka
title: Autodetecting JSON encoding
versions: Python 3.4
Added file: http://bugs.python.org/file30133/json_detect_encoding.patch
Python tracker <report at bugs.python.org>
More information about the New-bugs-announce