[Python-Dev] Dropping bytes "support" in json
Terry Reedy
tjreedy at udel.edu
Fri Apr 10 23:05:17 CEST 2009
glyph at divmod.com wrote:
>
> On 03:21 am, ncoghlan at gmail.com wrote:
>> Barry Warsaw wrote:
>
>>> I don't know whether the parameter thing will work or not, but you're
>>> probably right that we need to get the bytes-everywhere API first.
>
>> Given that json is a wire protocol, that sounds like the right approach
>> for json as well. Once bytes-everywhere works, then a text API can be
>> built on top of it, but it is difficult to build a bytes API on top of a
>> text one.
>
> I wish I could agree, but JSON isn't really a wire protocol. According
> to http://www.ietf.org/rfc/rfc4627.txt JSON is "a text format for the
> serialization of structured data". There are some notes about encoding,
> but it is very clearly described in terms of unicode code points.
>> So I guess the IO library *is* the right model: bytes at the bottom of
>> the stack, with text as a wrapper around it (mediated by codecs).
>
> In email's case this is true, but in JSON's case it's not. JSON is a
> format defined as a sequence of code points; MIME is defined as a
> sequence of octets.
What is the 'bytes support' issue for json? Is it about content within
a json text? Or about the transport format of a json text?
Reading rfc4627, a json text is a unicode string representation of an
instance of one of 6 classes. In Python terms, they are Nonetype, bool,
numbers (int, float, decimal?), (unicode) str, list, and [string-keyed]
dict. The representation is nearly identical to Python's literals and
displays.
For transport, the encoding SHALL be one of UTF-8, -16LE/BE, -32LE/BD,
with UFT-8 the 'default'.
So a json parser (a restricted eval()) tokenizes and parses a stream of
unicode chars which in Python could come from either a unicode string or
decoded bytes object. The bytes decoding could be either bulk or
incremental.
Similarly, a json generator (an repr()-like function) produces a stream
of unicode chars which again could be optionally encoded to bytes,
either incrementally or in bulk.
The standard does not specify any correspondence between representations
and domain objects, For Python making 'null', 'true', and 'false'
inter-convert with None, True, False is obvious. Numbers are slightly
more problemmtical. A generator could produce decimal literals from
both floats and decimals but without a non-json extension, a parser
could only convert back to one, so the other would not round-trip. (Int
could be handled by the presence or absence of '.0'.) Similarly, tuples
could be represented, like lists, as json square-bracketed arrays, but
they would be converted back to lists, not tuples, unless a non-json
extension were used.
So the two possible byte-suppost content issues I see are how to
represent them as legal json strings and/or whether some device should
be added to make them round-trip. But as indicated above, these two
issues are not unique to bytes.
Terry Jan Reedy
More information about the Python-Dev
mailing list