Decoding a huge JSON file incrementally
p.f.moore at gmail.com
Thu Dec 20 12:30:30 EST 2018
(Sorry, hit "Send" too soon on the last try!)
On Thu, 20 Dec 2018 at 17:22, Chris Angelico <rosuav at gmail.com> wrote:
> On Fri, Dec 21, 2018 at 2:44 AM Paul Moore <p.f.moore at gmail.com> wrote:
> > I'm looking for a way to incrementally decode a JSON file. I know this
> > has come up before, and in general the problem is not soluble (because
> > in theory the JSON file could be a single object). In my particular
> > situation, though, I have a 9GB file containing a top-level array
> > object, with many elements. So what I could (in theory) do is to parse
> > an element at a time, yielding them.
> > The problem is that the stdlib JSON library reads the whole file,
> > which defeats my purpose. What I'd like is if it would read one
> > complete element, then just enough far ahead to find out that the
> > parse was done, and return the object it found (it should probably
> > also return the "next token", as it can't reliably push it back - I'd
> > check that it was a comma before proceeding with the next list
> > element).
> It IS possible to do an incremental parse, but for that to work, you
> would need to manually strip off the top-level array structure. What
> you'd need to use would be this:
> It'll parse stuff and then tell you about what's left. Since your data
> isn't coming from a ginormous string, but is coming from a file,
> you're probably going to need something like this:
> def get_stuff_from_file(f):
> buffer = ""
> dec = json.JSONDecoder()
> while "not eof":
> while "no object yet":
> try: obj, pos = dec.raw_decode(buffer)
> except JSONDecodeError: buffer += f.read(1024)
> else: break
> yield obj
> buffer = buffer[pos:].lstrip().lstrip(",")
Ah, right. I'd found that function, but as it took input from a string
rather than a file-like object, I'd dismissed it. I didn't think of
decoding partial reads. That's a nice trick, thanks!
> Proper error handling is left as an exercise for the reader, both in
> terms of JSON errors and file errors. Also, the code is completely
> untested. Have fun :)
Yeah, once you have the insight that you can attempt to parse a block
at a time, the rest is just a "simple matter of programming" :-)
> The basic idea is that you keep on grabbing more data till you can
> decode an object, then you keep whatever didn't get used up ("pos"
> points to whatever didn't get consumed). Algorithmic complexity should
> be O(n) as long as your objects are relatively small, and you can
> optimize disk access by tuning your buffer size to be at least the
> average size of an object.
Got it, thanks.
> Hope that helps.
Yes it does, a lot. Much appreciated.
More information about the Python-list