<div class="gmail_quote">On Fri, Jun 15, 2012 at 3:06 PM, Nam Nguyen <span dir="ltr"><<a href="mailto:bitsink@gmail.com" target="_blank">bitsink@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
If I recall correctly, CPython memory management does not free memory.<br>
Once it has allocated a slab, it will not release that slab. The<br>
garbage collector makes room for CPython allocated objects in all the<br>
heap spaces that CPython allocated.<br>
<span class="HOEnZb"><font color="#888888">Nam<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
On Fri, Jun 15, 2012 at 2:44 PM, David Lawrence <<a href="mailto:david@bitcasa.com">david@bitcasa.com</a>> wrote:<br>
> On Fri, Jun 15, 2012 at 2:41 PM, Bob Ippolito <<a href="mailto:bob@redivi.com">bob@redivi.com</a>> wrote:<br>
>><br>
>> On Fri, Jun 15, 2012 at 5:32 PM, David Lawrence <<a href="mailto:david@bitcasa.com">david@bitcasa.com</a>> wrote:<br>
>>><br>
>>> On Fri, Jun 15, 2012 at 2:22 PM, Bob Ippolito <<a href="mailto:bob@redivi.com">bob@redivi.com</a>> wrote:<br>
>>>><br>
>>>> On Fri, Jun 15, 2012 at 4:15 PM, David Lawrence <<a href="mailto:david@bitcasa.com">david@bitcasa.com</a>><br>
>>>> wrote:<br>
>>>>><br>
>>>>> When I load the file into json, pythons memory usage spike to about<br>
>>>>> 1.8GB and I can't seem to get that memory to be released. I put together a<br>
>>>>> test case that's very simple:<br>
>>>>><br>
>>>>> with open("test_file.json", 'r') as f:<br>
>>>>> j = json.load(f)<br>
>>>>><br>
>>>>> I'm sorry that I can't provide a sample json file, my test file has a<br>
>>>>> lot of sensitive information, but for context, I'm dealing with a file in<br>
>>>>> the order of 240MB. After running the above 2 lines I have the<br>
>>>>> previously mentioned 1.8GB of memory in use. If I then do "del j" memory<br>
>>>>> usage doesn't drop at all. If I follow that with a "gc.collect()" it still<br>
>>>>> doesn't drop. I even tried unloading the json module and running another<br>
>>>>> gc.collect.<br>
>>>>><br>
>>>>> I'm trying to run some memory profiling but heapy has been churning<br>
>>>>> 100% CPU for about an hour now and has yet to produce any output.<br>
>>>>><br>
>>>>> Does anyone have any ideas? I've also tried the above using cjson<br>
>>>>> rather than the packaged json module. cjson used about 30% less memory but<br>
>>>>> otherwise displayed exactly the same issues.<br>
>>>>><br>
>>>>> I'm running Python 2.7.2 on Ubuntu server 11.10.<br>
>>>>><br>
>>>>> I'm happy to load up any memory profiler and see if it does better then<br>
>>>>> heapy and provide any diagnostics you might think are necessary. I'm<br>
>>>>> hunting around for a large test json file that I can provide for anyone else<br>
>>>>> to give it a go.<br>
>>>><br>
>>>><br>
>>>> It may just be the way that the allocator works. What happens if you<br>
>>>> load the JSON, del the object, then do it again? Does it take up 3.6GB or<br>
>>>> stay at 1.8GB? You may not be able to "release" that memory to the OS in<br>
>>>> such a way that RSS gets smaller... but at the same time it's not really a<br>
>>>> leak either.<br>
>>>><br>
>>>> GC shouldn't really take part in a JSON structure, since it's guaranteed<br>
>>>> to be acyclic… ref counting alone should be sufficient to instantly reclaim<br>
>>>> that space. I'm not at all surprised that gc.collect() doesn't change<br>
>>>> anything for CPython in this case.<br>
>>>><br>
>>>> $ python<br>
>>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)<br>
>>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on<br>
>>>> darwin<br>
>>>> Type "help", "copyright", "credits" or "license" for more information.<br>
>>>> >>> import os, subprocess, simplejson<br>
>>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',<br>
>>>> >>> str(os.getpid())],<br>
>>>> >>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()<br>
>>>> ...<br>
>>>> >>> rss()<br>
>>>> '7284'<br>
>>>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))<br>
>>>> >>> rss()<br>
>>>> '49032'<br>
>>>> >>> del l<br>
>>>> >>> rss()<br>
>>>> '42232'<br>
>>>> >>> l = simplejson.loads(simplejson.dumps([x for x in xrange(1000000)]))<br>
>>>> >>> rss()<br>
>>>> '49032'<br>
>>>> >>> del l<br>
>>>> >>> rss()<br>
>>>> '42232'<br>
>>>><br>
>>>> $ python<br>
>>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)<br>
>>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on<br>
>>>> darwin<br>
>>>> Type "help", "copyright", "credits" or "license" for more information.<br>
>>>> >>> import os, subprocess, simplejson<br>
>>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',<br>
>>>> >>> str(os.getpid())],<br>
>>>> >>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()<br>
>>>> ...<br>
>>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in<br>
>>>> >>> xrange(1000000))))<br>
>>>> >>> rss()<br>
>>>> '288116'<br>
>>>> >>> del l<br>
>>>> >>> rss()<br>
>>>> '84384'<br>
>>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in<br>
>>>> >>> xrange(1000000))))<br>
>>>> >>> rss()<br>
>>>> '288116'<br>
>>>> >>> del l<br>
>>>> >>> rss()<br>
>>>> '84384'<br>
>>>><br>
>>>> -bob<br>
>>>><br>
>>><br>
>>> It does appear that deleting the object and running the example again the<br>
>>> memory stays static at about 1.8GB. Could you provide a little more detail<br>
>>> on what your examples are meant to demonstrate. One shows a static memory<br>
>>> footprint and the other shows the footprint fluctuating up and down. I<br>
>>> would expect the static footprint in the first example just from my<br>
>>> understanding of python free lists of integers.<br>
>>><br>
>><br>
>> Both examples show the same thing, but with different data structures<br>
>> (list of int, dict of str:int). The only thing missing is that I left out<br>
>> the baseline in the second example, it would be the same as the first<br>
>> example.<br>
>><br>
>> The other suggestions are spot on. If you want the memory to really be<br>
>> released, you have to do it in a transient subprocess, and/or you could<br>
>> probably have lower overhead if you're using a streaming parser (if there's<br>
>> something you can do with it incrementally).<br>
>><br>
>> -bob<br>
>><br>
><br>
> Thank you all for the help. Multiprocessing with a Queue and blocking get()<br>
> calls looks like it will work well.<br>
><br>
</div></div><div class="HOEnZb"><div class="h5">> _______________________________________________<br>
> Baypiggies mailing list<br>
> <a href="mailto:Baypiggies@python.org">Baypiggies@python.org</a><br>
> To change your subscription options or unsubscribe:<br>
> <a href="http://mail.python.org/mailman/listinfo/baypiggies" target="_blank">http://mail.python.org/mailman/listinfo/baypiggies</a><br>
</div></div></blockquote></div><br><div><br></div><div>Lots of people have raised this idea in my hunt for answers. However, releasing memory to the OS appears to be objects dependent. I assume this is because different types use different memory allocators. Does anyone have a deeper understanding of this?</div>