Decoding a huge JSON file incrementally

Terry Reedy tjreedy at udel.edu
Thu Dec 20 13:52:29 EST 2018


On 12/20/2018 10:42 AM, Paul Moore wrote:
> I'm looking for a way to incrementally decode a JSON file. I know this
> has come up before, and in general the problem is not soluble (because
> in theory the JSON file could be a single object).

AFAIK, a JSON file always represents a single JSON item and is 
translated (decoded) to a single Python object.

The json encoder has a iterencode method, but the decoder does not have 
an iterdecode method.  I think it plausibly should have one that would 
iterate through a top-level list or dict (JSON array or object).

> In my particular
> situation, though, I have a 9GB file containing a top-level array
> object, with many elements. So what I could (in theory) do is to parse
> an element at a time, yielding them.

So your file format, not worrying about whitespace and the possible lack 
of ',' after the last item, is '[' (item ',')* ']'.  You want to skip 
over the '[' instead of creating an empty list, then yield each item 
rather than appending to the list.

> The problem is that the stdlib JSON library reads the whole file,
> which defeats my purpose. What I'd like is if it would read one
> complete element, then just enough far ahead to find out that the
> parse was done, and return the object it found (it should probably
> also return the "next token", as it can't reliably push it back - I'd
> check that it was a comma before proceeding with the next list
> element).

I looked at json.decode and json.scanner.  After reading the whole file 
into a string, json decodes the string an item at a time with a 
scan_once(string, index) function that finds the end of the first item 
in the string.  It then returns the decoded item and the index of where 
to continue scanning for the next item. If the string does not begin 
with a complete representation of an item, json.decode.JSONDecodeError 
is raised.

So I believe you could fairly easily write a function roughly as follows.
   open file and read and check the initial '['
   s = ''; idx = 0
   scanner = make_scanner(context)
   # I did not figure out what 'context' should be
   while more in file:
     s += large chunk
     try:
       ob, idx = scanner.scan_once(s, idx)
       yield ob
     except JSONDecodeError as e:
       check that problem is incompleteness rather than bad format


-- 
Terry Jan Reedy



More information about the Python-list mailing list