[Python-Dev] Dropping bytes "support" in json

Fri Apr 10 23:05:17 CEST 2009

glyph at divmod.com wrote:
> 
> On 03:21 am, ncoghlan at gmail.com wrote:
>> Barry Warsaw wrote:
> 
>>> I don't know whether the parameter thing will work or not, but you're
>>> probably right that we need to get the bytes-everywhere API first.
> 
>> Given that json is a wire protocol, that sounds like the right approach
>> for json as well. Once bytes-everywhere works, then a text API can be
>> built on top of it, but it is difficult to build a bytes API on top of a
>> text one.
> 
> I wish I could agree, but JSON isn't really a wire protocol.  According 
> to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the 
> serialization of structured data".  There are some notes about encoding, 
> but it is very clearly described in terms of unicode code points.
>> So I guess the IO library *is* the right model: bytes at the bottom of
>> the stack, with text as a wrapper around it (mediated by codecs).
> 
> In email's case this is true, but in JSON's case it's not.  JSON is a 
> format defined as a sequence of code points; MIME is defined as a 
> sequence of octets.

What is the 'bytes support' issue for json?  Is it about content within 
a json text? Or about the transport format of a json text?

Reading rfc4627, a json text is a unicode string representation of an 
instance of one of 6 classes.  In Python terms, they are Nonetype, bool, 
numbers (int, float, decimal?), (unicode) str, list, and [string-keyed] 
dict.  The representation is nearly identical to Python's literals and 
displays.

For transport,  the encoding SHALL be one of UTF-8, -16LE/BE, -32LE/BD, 
with UFT-8 the 'default'.

So a json parser (a restricted eval()) tokenizes and parses a stream of 
unicode chars which in Python could come from either a unicode string or 
decoded bytes object.  The bytes decoding could be either bulk or 
incremental.

Similarly, a json generator (an repr()-like function) produces a stream 
of unicode chars which again could be optionally encoded to bytes, 
either incrementally or in bulk.

The standard does not specify any correspondence between representations 
and domain objects,  For Python making 'null', 'true', and 'false' 
inter-convert with None, True, False is obvious.  Numbers are slightly 
more problemmtical.  A generator could produce decimal literals from 
both floats and decimals but without a non-json extension, a parser 
could only convert back to one, so the other would not round-trip. (Int 
could be handled by the presence or absence of '.0'.)  Similarly, tuples 
could be represented, like lists, as json square-bracketed arrays, but 
they would be converted back to lists, not tuples, unless a non-json 
extension were used.

So the two possible byte-suppost content issues I see are how to 
represent them as legal json strings and/or whether some device should 
be added to make them round-trip.  But as indicated above, these two 
issues are not unique to bytes.

Terry Jan Reedy