Decoding a huge JSON file incrementally

Chris Angelico rosuav at gmail.com
Thu Dec 20 12:19:17 EST 2018


On Fri, Dec 21, 2018 at 2:44 AM Paul Moore <p.f.moore at gmail.com> wrote:
>
> I'm looking for a way to incrementally decode a JSON file. I know this
> has come up before, and in general the problem is not soluble (because
> in theory the JSON file could be a single object). In my particular
> situation, though, I have a 9GB file containing a top-level array
> object, with many elements. So what I could (in theory) do is to parse
> an element at a time, yielding them.
>
> The problem is that the stdlib JSON library reads the whole file,
> which defeats my purpose. What I'd like is if it would read one
> complete element, then just enough far ahead to find out that the
> parse was done, and return the object it found (it should probably
> also return the "next token", as it can't reliably push it back - I'd
> check that it was a comma before proceeding with the next list
> element).

It IS possible to do an incremental parse, but for that to work, you
would need to manually strip off the top-level array structure. What
you'd need to use would be this:

https://docs.python.org/3/library/json.html#json.JSONDecoder.raw_decode

It'll parse stuff and then tell you about what's left. Since your data
isn't coming from a ginormous string, but is coming from a file,
you're probably going to need something like this:

def get_stuff_from_file(f):
    buffer = ""
    dec = json.JSONDecoder()
    while "not eof":
        while "no object yet":
            try: obj, pos = dec.raw_decode(buffer)
            except JSONDecodeError: buffer += f.read(1024)
            else: break
        yield obj
        buffer = buffer[pos:].lstrip().lstrip(",")

Proper error handling is left as an exercise for the reader, both in
terms of JSON errors and file errors. Also, the code is completely
untested. Have fun :)

The basic idea is that you keep on grabbing more data till you can
decode an object, then you keep whatever didn't get used up ("pos"
points to whatever didn't get consumed). Algorithmic complexity should
be O(n) as long as your objects are relatively small, and you can
optimize disk access by tuning your buffer size to be at least the
average size of an object.

Hope that helps.

ChrisA


More information about the Python-list mailing list