iterator/stream-based JSON API

When will the stdlib get a decent iterator/stream-based JSON API? For example automated packaging tools may be parsing a lot of JSON but ignoring most of it, and it would be lovely to say "if key not in interesting: skip_without_parsing()". Or to go straight from parse to domain objects without putting the whole thing in an intermediate dict. http://code.google.com/p/jsonpull/ is a nice looking but woefully undocumented Java one, and https://crate.io/packages/ijson/ is one with a much different API that is for Python.

From: Daniel Holth <dholth@gmail.com> Sent: Sunday, June 2, 2013 7:11 PM
When will the stdlib get a decent iterator/stream-based JSON API? For example automated packaging tools may be parsing a lot of JSON but ignoring most of it, and it would be lovely to say "if key not in interesting: skip_without_parsing()".
This sounds very interesting, but I'm not sure I'm thinking of the same thing you are. First, are you looking for a SAX-style API? Or something more like fts (http://man7.org/linux/man-pages/man3/fts.3.html) but with a nicer (iterator-plus-methods) syntax? Or… can you just post a complete simple example of what you'd like it to look like?
Or to go straight from parse to> domain objects without putting the whole thing in an intermediate dict.
What does this part mean? You mean you want an opaque object with a DOM API, or an XPath-style accessor, instead of a native Python object? If so, why? Do you think having a dict hidden inside an opaque object, or using some other hash table implementation, is going to save space over just having a dict? Or would you prefer to write doc.find_element("my_key").find_element("my_other_key").find_element(3) or doc.find('//my_key/my_other_key[3]") instead of doc["my_key"]["my_other_key"][3]?

On Jun 2, 2013, at 7:11 PM, Daniel Holth <dholth@gmail.com> wrote:
When will the stdlib get a decent iterator/stream-based JSON API?
You could probably write one right now using the object hooks, but I'm not sure it would be useful. If JSON data were too big too fit into memory (measured in gigabytes), it is going to have a host of other issues. Also, it might not work well with JSON where the outermost object is a dictionary in arbitrary order, meaning that one would potentially have to read the whole stream to find the first key. Raymond

On Sun, 2 Jun 2013 22:11:23 -0400 Daniel Holth <dholth@gmail.com> wrote:
When will the stdlib get a decent iterator/stream-based JSON API? For example automated packaging tools may be parsing a lot of JSON but ignoring most of it, and it would be lovely to say "if key not in interesting: skip_without_parsing()".
Do you think that would make a significant difference? I'm not sure what "parsing a lot of JSON" means, but I suppose packaging metadata is usually quite small. Regards Antoine.

On Mon, Jun 3, 2013 at 5:12 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 2 Jun 2013 22:11:23 -0400 Daniel Holth <dholth@gmail.com> wrote:
When will the stdlib get a decent iterator/stream-based JSON API? For example automated packaging tools may be parsing a lot of JSON but ignoring most of it, and it would be lovely to say "if key not in interesting: skip_without_parsing()".
No offense meant. The existing JSON API is quite good.
Do you think that would make a significant difference? I'm not sure what "parsing a lot of JSON" means, but I suppose packaging metadata is usually quite small.
I don't know whether it would matter for packaging but it would be very useful sometimes. The jsonpull API looks like: http://code.google.com/p/jsonpull/source/browse/trunk/Example.java A bit like: parser = Json(text) parser.eat('{') # expect an object for element in parser.objectElements(): parser.eat(Json.KEY) key = parser.getString() if key == "name": name = parser.getStringValue() elif key == "contact": You can ask it what the next token is, seek ahead (never behind) to a named key in an object, or iterate over all the keys in an object without necessarily iterating over child objects. Once you get to an interesting sub-object you can get an iterator for that sub-object and perhaps pass it to a child constructor.

On Mon, Jun 3, 2013 at 8:37 AM, Daniel Holth <dholth@gmail.com> wrote:
On Mon, Jun 3, 2013 at 5:12 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 2 Jun 2013 22:11:23 -0400 Daniel Holth <dholth@gmail.com> wrote:
When will the stdlib get a decent iterator/stream-based JSON API? For example automated packaging tools may be parsing a lot of JSON but ignoring most of it, and it would be lovely to say "if key not in interesting: skip_without_parsing()".
No offense meant. The existing JSON API is quite good.
Do you think that would make a significant difference? I'm not sure what "parsing a lot of JSON" means, but I suppose packaging metadata is usually quite small.
I don't know whether it would matter for packaging but it would be very useful sometimes.
The jsonpull API looks like: http://code.google.com/p/jsonpull/source/browse/trunk/Example.java
A bit like:
parser = Json(text) parser.eat('{') # expect an object for element in parser.objectElements(): parser.eat(Json.KEY) key = parser.getString() if key == "name": name = parser.getStringValue() elif key == "contact":
You can ask it what the next token is, seek ahead (never behind) to a named key in an object, or iterate over all the keys in an object without necessarily iterating over child objects. Once you get to an interesting sub-object you can get an iterator for that sub-object and perhaps pass it to a child constructor.
The ijson API yields a stream of events containing the full path to each item in the parsed JSON, an event name like "start_map", "end_map", "start_array", ... list(ijson.parse(StringIO.StringIO("""{ "a": { "b": "c" } }"""))) [('', 'start_map', None), ('', 'map_key', 'a'), ('a', 'start_map', None), ('a', 'map_key', 'b'), ('a.b', 'string', u'c'), ('a', 'end_map', None), ('', 'end_map', None)] It also has a higher-level API yielding only the objects under a certain prefix. Pass "a.b" and you would get only "c". Besides memory this kind of thing makes it much easier to know which level of the JSON structure you are in compared to the existing object_pairs hook. I kindof like the pull API because you can "choose your own adventure", deciding whether to do higher or lower level parsing depending on where in the JSON you are. But you could easily get lost and do things that aren't permitted based on the parser state. Daniel Holth

On Mon, 3 Jun 2013 09:01:43 -0400 Daniel Holth <dholth@gmail.com> wrote:
The ijson API yields a stream of events containing the full path to each item in the parsed JSON, an event name like "start_map", "end_map", "start_array", ...
list(ijson.parse(StringIO.StringIO("""{ "a": { "b": "c" } }""")))
[('', 'start_map', None), ('', 'map_key', 'a'), ('a', 'start_map', None), ('a', 'map_key', 'b'), ('a.b', 'string', u'c'), ('a', 'end_map', None), ('', 'end_map', None)]
It also has a higher-level API yielding only the objects under a certain prefix. Pass "a.b" and you would get only "c".
Besides memory this kind of thing makes it much easier to know which level of the JSON structure you are in compared to the existing object_pairs hook.
But did you encounter a use case where the existing API didn't fit the bill? Regards Antoine.

On Mon, Jun 3, 2013 at 11:43 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 3 Jun 2013 09:01:43 -0400 Daniel Holth <dholth@gmail.com> wrote:
The ijson API yields a stream of events containing the full path to each item in the parsed JSON, an event name like "start_map", "end_map", "start_array", ...
list(ijson.parse(StringIO.StringIO("""{ "a": { "b": "c" } }""")))
[('', 'start_map', None), ('', 'map_key', 'a'), ('a', 'start_map', None), ('a', 'map_key', 'b'), ('a.b', 'string', u'c'), ('a', 'end_map', None), ('', 'end_map', None)]
It also has a higher-level API yielding only the objects under a certain prefix. Pass "a.b" and you would get only "c".
Besides memory this kind of thing makes it much easier to know which level of the JSON structure you are in compared to the existing object_pairs hook.
But did you encounter a use case where the existing API didn't fit the bill?
Sometimes it's nice to have a stream-based API; when your memory is very small, your JSON is very large, or your JSON may be very large, or the JSON is being streamed to you little by little and you want to parse part of it, or just don't want to wait for that closing }. It's just a different way to parse than the current all-at-once option.

On 03.06.13 17:58, Daniel Holth wrote:
On Mon, Jun 3, 2013 at 11:43 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 3 Jun 2013 09:01:43 -0400 Daniel Holth <dholth@gmail.com> wrote:
The ijson API yields a stream of events containing the full path to each item in the parsed JSON, an event name like "start_map", "end_map", "start_array", ...
list(ijson.parse(StringIO.StringIO("""{ "a": { "b": "c" } }""")))
[('', 'start_map', None), ('', 'map_key', 'a'), ('a', 'start_map', None), ('a', 'map_key', 'b'), ('a.b', 'string', u'c'), ('a', 'end_map', None), ('', 'end_map', None)]
It also has a higher-level API yielding only the objects under a certain prefix. Pass "a.b" and you would get only "c".
Besides memory this kind of thing makes it much easier to know which level of the JSON structure you are in compared to the existing object_pairs hook.
But did you encounter a use case where the existing API didn't fit the bill?
Sometimes it's nice to have a stream-based API; when your memory is very small, your JSON is very large, or your JSON may be very large, or the JSON is being streamed to you little by little and you want to parse part of it, or just don't want to wait for that closing }. It's just a different way to parse than the current all-at-once option.
I share this experience but would like to add: The JSON RFC (like standard python dicts) makes no ordering assumption on keys but OTOH serialized data must be somehow streamable, i.e. even with keys per application protocol predefined and meaningfully ordered. Per RFC you have to collect all keys of a certain level. Upon encountering the matching closing curly brace you may finally inspect them, where in real life(tm) you often look for an early out (ingredients a) missing or b) all mandatory present) to accelerate node processing. So I guess the usefullness of streaming solutions based on JSON really also depends on the additional artifact, that serializer and deserializer must share an out-of band convention. Stefan.

On 2013-06-03 13:37, Daniel Holth wrote:
On Mon, Jun 3, 2013 at 5:12 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
On Sun, 2 Jun 2013 22:11:23 -0400 Daniel Holth <dholth@gmail.com> wrote:
When will the stdlib get a decent iterator/stream-based JSON API? For example automated packaging tools may be parsing a lot of JSON but ignoring most of it, and it would be lovely to say "if key not in interesting: skip_without_parsing()".
No offense meant. The existing JSON API is quite good.
Do you think that would make a significant difference? I'm not sure what "parsing a lot of JSON" means, but I suppose packaging metadata is usually quite small.
I don't know whether it would matter for packaging but it would be very useful sometimes.
The jsonpull API looks like: http://code.google.com/p/jsonpull/source/browse/trunk/Example.java
A bit like:
parser = Json(text) parser.eat('{') # expect an object for element in parser.objectElements(): parser.eat(Json.KEY) key = parser.getString() if key == "name": name = parser.getStringValue() elif key == "contact":
You can ask it what the next token is, seek ahead (never behind) to a named key in an object, or iterate over all the keys in an object without necessarily iterating over child objects. Once you get to an interesting sub-object you can get an iterator for that sub-object and perhaps pass it to a child constructor.
Unless if your JSON file has dict values of, say, megabytes in size, I doubt that writing such code is going to be much more efficient than just building the whole dict and ignoring the keys that you don't want. I suspect that most of the use cases could be satisfied by being able to either whitelist or blacklist top-level keys. This would be a relatively simple modification to _json.c, I think, if you wanted to pursue it. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
participants (6)
-
Andrew Barnert
-
Antoine Pitrou
-
Daniel Holth
-
Raymond Hettinger
-
Robert Kern
-
Stefan Drees