Paul Moore writes:
On the other hand, further down in the document:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.
Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets. """
This is at best confused (in my utterly non-expert opinion :-)) as Unicode isn't an encoding...
The word "encoding" (by itself) does not have a standard definition AFAIK. However, since Unicode is a "coded character set" (plus a bunch of hairy usage rules), there's nothing wrong with saying "text is encoded in Unicode". The RFC 2130 and Unicode TR#17 taxonomies are annoying verbose and pedantic to say the least.
So what is being said there (in UTR#17 terminology) is
(1) JSON is text, that is, a sequence of characters. (2) The abstract repertoire and coded character set are defined by the Unicode standard. (3) The default transfer encoding syntax is UTF-8.
That implies that loads can/should also allow bytes as input, applying the given algorithm to guess an encoding.
It's not a guess, unless the data stream is corrupt---or nonconforming.
But it should not be the JSON package's responsibility to deal with corruption or non-conformance (eg, ISO-8859-15-encoded programs). That's the whole point of specifying the coded character set in the standard the first place. I think it's a bad idea for any of the core JSON API to accept or produce bytes in any language that provides a Unicode string type.
That doesn't mean Python's module shouldn't provide convenience functions to read and write JSON serialized as UTF-8 (in fact, that should be done, IMO) and/or other UTFs (I'm not so happy about that). But those who write programs using them should not report bugs until they've checked out and eliminated the possibility of an encoding screwup!