[Python-Dev] Dropping bytes "support" in json
p.f.moore at gmail.com
Fri Apr 10 13:53:47 CEST 2009
2009/4/10 Nick Coghlan <ncoghlan at gmail.com>:
> glyph at divmod.com wrote:
>> On 03:21 am, ncoghlan at gmail.com wrote:
>>> Given that json is a wire protocol, that sounds like the right approach
>>> for json as well. Once bytes-everywhere works, then a text API can be
>>> built on top of it, but it is difficult to build a bytes API on top of a
>>> text one.
>> I wish I could agree, but JSON isn't really a wire protocol. According
>> to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the
>> serialization of structured data". There are some notes about encoding,
>> but it is very clearly described in terms of unicode code points.
> Ah, my apologies - if the RFC defines things such that the native format
> is Unicode, then yes, the appropriate Python 3.x data type for the base
> implementation would indeed be strings.
Indeed, the RFC seems to clearly imply that loads should take a
Unicode string, dumps should produce one, and load/dump should work in
terms of text files (not byte files).
On the other hand, further down in the document:
JSON text SHALL be encoded in Unicode. The default encoding is
Since the first two characters of a JSON text will always be ASCII
characters [RFC0020], it is possible to determine whether an octet
stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking
at the pattern of nulls in the first four octets.
This is at best confused (in my utterly non-expert opinion :-)) as
Unicode isn't an encoding...
I would guess that what the RFC is trying to say is that JSON is text
(Unicode) and where a byte stream purporting to be JSON is encountered
without a defined encoding, this is how to guess one.
That implies that loads can/should also allow bytes as input, applying
the given algorithm to guess an encoding. And similarly load
can/should accept a byte stream, on the same basis. (There's no need
to allow the possibility of accepting bytes plus an encoding - in that
case the user should decode the bytes before passing Unicode to the
An alternative might be for the JSON module to register a special
encoding ('JSON-guess'?) which captures the rules here. Then there's
no need for special bytes parameter handling.
Of course, this is all from a native English speaker, who therefore
has no idea of the real life issues involved in Unicode :-)
More information about the Python-Dev