[Baypiggies] json using huge memory footprint and not releasing

Sat Jun 16 06:29:35 CEST 2012

On Fri, Jun 15, 2012 at 5:41 PM, Bob Ippolito <bob at redivi.com> wrote:

> On Fri, Jun 15, 2012 at 5:32 PM, David Lawrence <david at bitcasa.com> wrote:
>
>> On Fri, Jun 15, 2012 at 2:22 PM, Bob Ippolito <bob at redivi.com> wrote:
>>
>>> On Fri, Jun 15, 2012 at 4:15 PM, David Lawrence <david at bitcasa.com>wrote:
>>>
>>>> When I load the file into json, pythons memory usage spike to about
>>>> 1.8GB and I can't seem to get that memory to be released.  I put together a
>>>> test case that's very simple:
>>>>
>>>> with open("test_file.json", 'r') as f:
>>>>     j = json.load(f)
>>>>
>>>> I'm sorry that I can't provide a sample json file, my test file has a
>>>> lot of sensitive information, but for context, I'm dealing with a file in
>>>> the order of 240MB.  After running the above 2 lines I have the
>>>> previously mentioned 1.8GB of memory in use.  If I then do "del j" memory
>>>> usage doesn't drop at all.  If I follow that with a "gc.collect()" it still
>>>> doesn't drop.  I even tried unloading the json module and running another
>>>> gc.collect.
>>>>
>>>> I'm trying to run some memory profiling but heapy has been churning
>>>> 100% CPU for about an hour now and has yet to produce any output.
>>>>
>>>> Does anyone have any ideas?  I've also tried the above using cjson
>>>> rather than the packaged json module.  cjson used about 30% less memory but
>>>> otherwise displayed exactly the same issues.
>>>>
>>>> I'm running Python 2.7.2 on Ubuntu server 11.10.
>>>>
>>>> I'm happy to load up any memory profiler and see if it does better then
>>>> heapy and provide any diagnostics you might think are necessary.  I'm
>>>> hunting around for a large test json file that I can provide for anyone
>>>> else to give it a go.
>>>>
>>>
>>> It may just be the way that the allocator works. What happens if you
>>> load the JSON, del the object, then do it again? Does it take up 3.6GB or
>>> stay at 1.8GB? You may not be able to "release" that memory to the OS in
>>> such a way that RSS gets smaller... but at the same time it's not really a
>>> leak either.
>>>
>>> GC shouldn't really take part in a JSON structure, since it's guaranteed
>>> to be acyclic… ref counting alone should be sufficient to instantly reclaim
>>> that space. I'm not at all surprised that gc.collect() doesn't change
>>> anything for CPython in this case.
>>>
>>> $ python
>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
>>> darwin
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> import os, subprocess, simplejson
>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
>>> str(os.getpid())],
>>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
>>> ...
>>> >>> rss()
>>> '7284'
>>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))
>>> >>> rss()
>>> '49032'
>>> >>> del l
>>> >>> rss()
>>> '42232'
>>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))
>>> >>> rss()
>>> '49032'
>>> >>> del l
>>> >>> rss()
>>> '42232'
>>>
>>> $ python
>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
>>> darwin
>>> Type "help", "copyright", "credits" or "license" for more information.
>>> >>> import os, subprocess, simplejson
>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
>>> str(os.getpid())],
>>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
>>> ...
>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
>>> xrange(1000000))))
>>> >>> rss()
>>> '288116'
>>> >>> del l
>>> >>> rss()
>>> '84384'
>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
>>> xrange(1000000))))
>>> >>> rss()
>>> '288116'
>>> >>> del l
>>> >>> rss()
>>> '84384'
>>>
>>> -bob
>>>
>>>
>> It does appear that deleting the object and running the example again the
>> memory stays static at about 1.8GB.  Could you provide a little more detail
>> on what your examples are meant to demonstrate.  One shows a static memory
>> footprint and the other shows the footprint fluctuating up and down.  I
>> would expect the static footprint in the first example just from my
>> understanding of python free lists of integers.
>>
>>
> Both examples show the same thing, but with different data structures
> (list of int, dict of str:int). The only thing missing is that I left out
> the baseline in the second example, it would be the same as the first
> example.
>
> The other suggestions are spot on. If you want the memory to really be
> released, you have to do it in a transient subprocess, and/or you could
> probably have lower overhead if you're using a streaming parser (if there's
> something you can do with it incrementally).
>

It does actually look like gc.collect() does some work in this case,
despite the data structures being acyclic, because the GC will clear some
of the freelists. It's not the sort of full compaction that you would
expect, but it does help for certain kinds of objects. You're still better
off with subprocesses for doing short work with a big transient data
structure.

-bob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20120616/44fa3d5e/attachment-0001.html>