[Python-Dev] Dropping bytes "support" in json

Fri Apr 10 17:55:25 CEST 2009

On Fri, Apr 10, 2009 at 8:38 AM, Stephen J. Turnbull <stephen at xemacs.org> wrote:
> Paul Moore writes:
>
>  > On the other hand, further down in the document:
>  >
>  > """
>  > 3.  Encoding
>  >
>  >    JSON text SHALL be encoded in Unicode.  The default encoding is
>  >    UTF-8.
>  >
>  >    Since the first two characters of a JSON text will always be ASCII
>  >    characters [RFC0020], it is possible to determine whether an octet
>  >    stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
>  >    at the pattern of nulls in the first four octets.
>  > """
>  >
>  > This is at best confused (in my utterly non-expert opinion :-)) as
>  > Unicode isn't an encoding...
>
> The word "encoding" (by itself) does not have a standard definition
> AFAIK.  However, since Unicode *is* a "coded character set" (plus a
> bunch of hairy usage rules), there's nothing wrong with saying "text
> is encoded in Unicode".  The RFC 2130 and Unicode TR#17 taxonomies are
> annoying verbose and pedantic to say the least.
>
> So what is being said there (in UTR#17 terminology) is
>
> (1) JSON is *text*, that is, a sequence of characters.
> (2) The abstract repertoire and coded character set are defined by the
>    Unicode standard.
> (3) The default transfer encoding syntax is UTF-8.
>
>  > That implies that loads can/should also allow bytes as input, applying
>  > the given algorithm to guess an encoding.
>
> It's not a guess, unless the data stream is corrupt---or nonconforming.
>
> But it should not be the JSON package's responsibility to deal with
> corruption or non-conformance (eg, ISO-8859-15-encoded programs).
> That's the whole point of specifying the coded character set in the
> standard the first place.  I think it's a bad idea for any of the core
> JSON API to accept or produce bytes in any language that provides a
> Unicode string type.
>
> That doesn't mean Python's module shouldn't provide convenience
> functions to read and write JSON serialized as UTF-8 (in fact, that
> *should* be done, IMO) and/or other UTFs (I'm not so happy about
> that).  But those who write programs using them should not report bugs
> until they've checked out and eliminated the possibility of an
> encoding screwup!

The current implementation doesn't do any encoding guesswork and I
have no intention to allow that as a feature. The input must be
unicode, UTF-8 bytes, or an encoding must be specified.

Personally most of experience with JSON is as a wire protocol and thus
bytes, so the obvious function to encode json should do that. There
probably should be another function to get unicode output, but nobody
has ever asked for that in the Python 2.x version. They either want
the default behavior (encoding as ASCII str which can be used as
unicode due to implementation details of Python 2.x) or encoding as a
more compact UTF-8 str (without escaping non-ASCII code points).
Perhaps Python 3 users would ask for a unicode output when decoding
though.

-bob