[issue17909] Autodetecting JSON encoding

Sun May 5 15:10:30 CEST 2013

New submission from Serhiy Storchaka:

RFC 4627 specifies a method to determine an encoding (one of UTF-8, UTF-16(BE|LE) or UTF-32(BE|LE)) of encoded JSON text. The proposed preliminary patch (it doesn't include the documentation yet) allows load() and loads() functions accept bytes data when it is encoded with standard Unicode encoding. Also accepted data with BOM (this doesn't specified in RFC 4627, but is widely used).

There is only one case where the method can give a misfire. Serialized string "\x00..." encoded in UTF-16LE may be erroneously detected as encoded in UTF-32LE. This case violates the two rules of RFC 4627: the string was serialized instead of a an object or an array, and the control character U+0000 was not escaped. The standard encoded JSON always detected correctly.

This patch requires "surrogatepass" error handler for utf-16/32 (see issue12892 and issue13916).

----------
assignee: serhiy.storchaka
components: Library (Lib), Unicode
files: json_detect_encoding.patch
keywords: patch
messages: 188442
nosy: ezio.melotti, pitrou, rhettinger, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Autodetecting JSON encoding
type: enhancement
versions: Python 3.4
Added file: http://bugs.python.org/file30133/json_detect_encoding.patch

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue17909>
_______________________________________