[Baypiggies] json using huge memory footprint and not releasing

Sat Jun 16 00:06:55 CEST 2012

If I recall correctly, CPython memory management does not free memory.
Once it has allocated a slab, it will not release that slab. The
garbage collector makes room for CPython allocated objects in all the
heap spaces that CPython allocated.
Nam

On Fri, Jun 15, 2012 at 2:44 PM, David Lawrence <david at bitcasa.com> wrote:
> On Fri, Jun 15, 2012 at 2:41 PM, Bob Ippolito <bob at redivi.com> wrote:
>>
>> On Fri, Jun 15, 2012 at 5:32 PM, David Lawrence <david at bitcasa.com> wrote:
>>>
>>> On Fri, Jun 15, 2012 at 2:22 PM, Bob Ippolito <bob at redivi.com> wrote:
>>>>
>>>> On Fri, Jun 15, 2012 at 4:15 PM, David Lawrence <david at bitcasa.com>
>>>> wrote:
>>>>>
>>>>> When I load the file into json, pythons memory usage spike to about
>>>>> 1.8GB and I can't seem to get that memory to be released.  I put together a
>>>>> test case that's very simple:
>>>>>
>>>>> with open("test_file.json", 'r') as f:
>>>>>     j = json.load(f)
>>>>>
>>>>> I'm sorry that I can't provide a sample json file, my test file has a
>>>>> lot of sensitive information, but for context, I'm dealing with a file in
>>>>> the order of 240MB.  After running the above 2 lines I have the
>>>>> previously mentioned 1.8GB of memory in use.  If I then do "del j" memory
>>>>> usage doesn't drop at all.  If I follow that with a "gc.collect()" it still
>>>>> doesn't drop.  I even tried unloading the json module and running another
>>>>> gc.collect.
>>>>>
>>>>> I'm trying to run some memory profiling but heapy has been churning
>>>>> 100% CPU for about an hour now and has yet to produce any output.
>>>>>
>>>>> Does anyone have any ideas?  I've also tried the above using cjson
>>>>> rather than the packaged json module.  cjson used about 30% less memory but
>>>>> otherwise displayed exactly the same issues.
>>>>>
>>>>> I'm running Python 2.7.2 on Ubuntu server 11.10.
>>>>>
>>>>> I'm happy to load up any memory profiler and see if it does better then
>>>>> heapy and provide any diagnostics you might think are necessary.  I'm
>>>>> hunting around for a large test json file that I can provide for anyone else
>>>>> to give it a go.
>>>>
>>>>
>>>> It may just be the way that the allocator works. What happens if you
>>>> load the JSON, del the object, then do it again? Does it take up 3.6GB or
>>>> stay at 1.8GB? You may not be able to "release" that memory to the OS in
>>>> such a way that RSS gets smaller... but at the same time it's not really a
>>>> leak either.
>>>>
>>>> GC shouldn't really take part in a JSON structure, since it's guaranteed
>>>> to be acyclic… ref counting alone should be sufficient to instantly reclaim
>>>> that space. I'm not at all surprised that gc.collect() doesn't change
>>>> anything for CPython in this case.
>>>>
>>>> $ python
>>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
>>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
>>>> darwin
>>>> Type "help", "copyright", "credits" or "license" for more information.
>>>> >>> import os, subprocess, simplejson
>>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
>>>> >>> str(os.getpid())],
>>>> >>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
>>>> ...
>>>> >>> rss()
>>>> '7284'
>>>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))
>>>> >>> rss()
>>>> '49032'
>>>> >>> del l
>>>> >>> rss()
>>>> '42232'
>>>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))
>>>> >>> rss()
>>>> '49032'
>>>> >>> del l
>>>> >>> rss()
>>>> '42232'
>>>>
>>>> $ python
>>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
>>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on
>>>> darwin
>>>> Type "help", "copyright", "credits" or "license" for more information.
>>>> >>> import os, subprocess, simplejson
>>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
>>>> >>> str(os.getpid())],
>>>> >>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
>>>> ...
>>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
>>>> >>> xrange(1000000))))
>>>> >>> rss()
>>>> '288116'
>>>> >>> del l
>>>> >>> rss()
>>>> '84384'
>>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
>>>> >>> xrange(1000000))))
>>>> >>> rss()
>>>> '288116'
>>>> >>> del l
>>>> >>> rss()
>>>> '84384'
>>>>
>>>> -bob
>>>>
>>>
>>> It does appear that deleting the object and running the example again the
>>> memory stays static at about 1.8GB.  Could you provide a little more detail
>>> on what your examples are meant to demonstrate.  One shows a static memory
>>> footprint and the other shows the footprint fluctuating up and down.  I
>>> would expect the static footprint in the first example just from my
>>> understanding of python free lists of integers.
>>>
>>
>> Both examples show the same thing, but with different data structures
>> (list of int, dict of str:int). The only thing missing is that I left out
>> the baseline in the second example, it would be the same as the first
>> example.
>>
>> The other suggestions are spot on. If you want the memory to really be
>> released, you have to do it in a transient subprocess, and/or you could
>> probably have lower overhead if you're using a streaming parser (if there's
>> something you can do with it incrementally).
>>
>> -bob
>>
>
> Thank you all for the help.  Multiprocessing with a Queue and blocking get()
> calls looks like it will work well.
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies