[Python-Dev] Dropping bytes "support" in json

Fri Apr 10 17:38:03 CEST 2009

Paul Moore writes:

 > On the other hand, further down in the document:
 > 
 > """
 > 3.  Encoding
 > 
 >    JSON text SHALL be encoded in Unicode.  The default encoding is
 >    UTF-8.
 > 
 >    Since the first two characters of a JSON text will always be ASCII
 >    characters [RFC0020], it is possible to determine whether an octet
 >    stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
 >    at the pattern of nulls in the first four octets.
 > """
 > 
 > This is at best confused (in my utterly non-expert opinion :-)) as
 > Unicode isn't an encoding...

The word "encoding" (by itself) does not have a standard definition
AFAIK.  However, since Unicode *is* a "coded character set" (plus a
bunch of hairy usage rules), there's nothing wrong with saying "text
is encoded in Unicode".  The RFC 2130 and Unicode TR#17 taxonomies are
annoying verbose and pedantic to say the least.

So what is being said there (in UTR#17 terminology) is

(1) JSON is *text*, that is, a sequence of characters.
(2) The abstract repertoire and coded character set are defined by the
    Unicode standard.
(3) The default transfer encoding syntax is UTF-8.

 > That implies that loads can/should also allow bytes as input, applying
 > the given algorithm to guess an encoding.

It's not a guess, unless the data stream is corrupt---or nonconforming.

But it should not be the JSON package's responsibility to deal with
corruption or non-conformance (eg, ISO-8859-15-encoded programs).
That's the whole point of specifying the coded character set in the
standard the first place.  I think it's a bad idea for any of the core
JSON API to accept or produce bytes in any language that provides a
Unicode string type.

That doesn't mean Python's module shouldn't provide convenience
functions to read and write JSON serialized as UTF-8 (in fact, that
*should* be done, IMO) and/or other UTFs (I'm not so happy about
that).  But those who write programs using them should not report bugs
until they've checked out and eliminated the possibility of an
encoding screwup!