Decoding a huge JSON file incrementally
Terry Reedy
tjreedy at udel.edu
Thu Dec 20 13:52:29 EST 2018
On 12/20/2018 10:42 AM, Paul Moore wrote:
> I'm looking for a way to incrementally decode a JSON file. I know this
> has come up before, and in general the problem is not soluble (because
> in theory the JSON file could be a single object).
AFAIK, a JSON file always represents a single JSON item and is
translated (decoded) to a single Python object.
The json encoder has a iterencode method, but the decoder does not have
an iterdecode method. I think it plausibly should have one that would
iterate through a top-level list or dict (JSON array or object).
> In my particular
> situation, though, I have a 9GB file containing a top-level array
> object, with many elements. So what I could (in theory) do is to parse
> an element at a time, yielding them.
So your file format, not worrying about whitespace and the possible lack
of ',' after the last item, is '[' (item ',')* ']'. You want to skip
over the '[' instead of creating an empty list, then yield each item
rather than appending to the list.
> The problem is that the stdlib JSON library reads the whole file,
> which defeats my purpose. What I'd like is if it would read one
> complete element, then just enough far ahead to find out that the
> parse was done, and return the object it found (it should probably
> also return the "next token", as it can't reliably push it back - I'd
> check that it was a comma before proceeding with the next list
> element).
I looked at json.decode and json.scanner. After reading the whole file
into a string, json decodes the string an item at a time with a
scan_once(string, index) function that finds the end of the first item
in the string. It then returns the decoded item and the index of where
to continue scanning for the next item. If the string does not begin
with a complete representation of an item, json.decode.JSONDecodeError
is raised.
So I believe you could fairly easily write a function roughly as follows.
open file and read and check the initial '['
s = ''; idx = 0
scanner = make_scanner(context)
# I did not figure out what 'context' should be
while more in file:
s += large chunk
try:
ob, idx = scanner.scan_once(s, idx)
yield ob
except JSONDecodeError as e:
check that problem is incompleteness rather than bad format
--
Terry Jan Reedy
More information about the Python-list
mailing list