[Baypiggies] json using huge memory footprint and not releasing

David Lawrence david at bitcasa.com
Sat Jun 16 00:10:31 CEST 2012


On Fri, Jun 15, 2012 at 3:06 PM, Nam Nguyen <bitsink at gmail.com> wrote:

> If I recall correctly, CPython memory management does not free memory.
> Once it has allocated a slab, it will not release that slab. The
> garbage collector makes room for CPython allocated objects in all the
> heap spaces that CPython allocated.
> Nam
>
> On Fri, Jun 15, 2012 at 2:44 PM, David Lawrence <david at bitcasa.com> wrote:
> > On Fri, Jun 15, 2012 at 2:41 PM, Bob Ippolito <bob at redivi.com> wrote:
> >>
> >> On Fri, Jun 15, 2012 at 5:32 PM, David Lawrence <david at bitcasa.com>
> wrote:
> >>>
> >>> On Fri, Jun 15, 2012 at 2:22 PM, Bob Ippolito <bob at redivi.com> wrote:
> >>>>
> >>>> On Fri, Jun 15, 2012 at 4:15 PM, David Lawrence <david at bitcasa.com>
> >>>> wrote:
> >>>>>
> >>>>> When I load the file into json, pythons memory usage spike to about
> >>>>> 1.8GB and I can't seem to get that memory to be released.  I put
> together a
> >>>>> test case that's very simple:
> >>>>>
> >>>>> with open("test_file.json", 'r') as f:
> >>>>>     j = json.load(f)
> >>>>>
> >>>>> I'm sorry that I can't provide a sample json file, my test file has a
> >>>>> lot of sensitive information, but for context, I'm dealing with a
> file in
> >>>>> the order of 240MB.  After running the above 2 lines I have the
> >>>>> previously mentioned 1.8GB of memory in use.  If I then do "del j"
> memory
> >>>>> usage doesn't drop at all.  If I follow that with a "gc.collect()"
> it still
> >>>>> doesn't drop.  I even tried unloading the json module and running
> another
> >>>>> gc.collect.
> >>>>>
> >>>>> I'm trying to run some memory profiling but heapy has been churning
> >>>>> 100% CPU for about an hour now and has yet to produce any output.
> >>>>>
> >>>>> Does anyone have any ideas?  I've also tried the above using cjson
> >>>>> rather than the packaged json module.  cjson used about 30% less
> memory but
> >>>>> otherwise displayed exactly the same issues.
> >>>>>
> >>>>> I'm running Python 2.7.2 on Ubuntu server 11.10.
> >>>>>
> >>>>> I'm happy to load up any memory profiler and see if it does better
> then
> >>>>> heapy and provide any diagnostics you might think are necessary.  I'm
> >>>>> hunting around for a large test json file that I can provide for
> anyone else
> >>>>> to give it a go.
> >>>>
> >>>>
> >>>> It may just be the way that the allocator works. What happens if you
> >>>> load the JSON, del the object, then do it again? Does it take up
> 3.6GB or
> >>>> stay at 1.8GB? You may not be able to "release" that memory to the OS
> in
> >>>> such a way that RSS gets smaller... but at the same time it's not
> really a
> >>>> leak either.
> >>>>
> >>>> GC shouldn't really take part in a JSON structure, since it's
> guaranteed
> >>>> to be acyclic… ref counting alone should be sufficient to instantly
> reclaim
> >>>> that space. I'm not at all surprised that gc.collect() doesn't change
> >>>> anything for CPython in this case.
> >>>>
> >>>> $ python
> >>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
> >>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)]
> on
> >>>> darwin
> >>>> Type "help", "copyright", "credits" or "license" for more information.
> >>>> >>> import os, subprocess, simplejson
> >>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
> >>>> >>> str(os.getpid())],
> >>>> >>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
> >>>> ...
> >>>> >>> rss()
> >>>> '7284'
> >>>> >>> l = simplejson.loads(simplejson.dumps([x for x in
> xrange(1000000)]))
> >>>> >>> rss()
> >>>> '49032'
> >>>> >>> del l
> >>>> >>> rss()
> >>>> '42232'
> >>>> >>> l = simplejson.loads(simplejson.dumps([x for x in
> xrange(1000000)]))
> >>>> >>> rss()
> >>>> '49032'
> >>>> >>> del l
> >>>> >>> rss()
> >>>> '42232'
> >>>>
> >>>> $ python
> >>>> Python 2.7.2 (default, Jan 23 2012, 14:26:16)
> >>>> [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)]
> on
> >>>> darwin
> >>>> Type "help", "copyright", "credits" or "license" for more information.
> >>>> >>> import os, subprocess, simplejson
> >>>> >>> def rss(): return subprocess.Popen(['ps', '-o', 'rss', '-p',
> >>>> >>> str(os.getpid())],
> >>>> >>> stdout=subprocess.PIPE).communicate()[0].splitlines()[1].strip()
> >>>> ...
> >>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
> >>>> >>> xrange(1000000))))
> >>>> >>> rss()
> >>>> '288116'
> >>>> >>> del l
> >>>> >>> rss()
> >>>> '84384'
> >>>> >>> l = simplejson.loads(simplejson.dumps(dict((str(x), x) for x in
> >>>> >>> xrange(1000000))))
> >>>> >>> rss()
> >>>> '288116'
> >>>> >>> del l
> >>>> >>> rss()
> >>>> '84384'
> >>>>
> >>>> -bob
> >>>>
> >>>
> >>> It does appear that deleting the object and running the example again
> the
> >>> memory stays static at about 1.8GB.  Could you provide a little more
> detail
> >>> on what your examples are meant to demonstrate.  One shows a static
> memory
> >>> footprint and the other shows the footprint fluctuating up and down.  I
> >>> would expect the static footprint in the first example just from my
> >>> understanding of python free lists of integers.
> >>>
> >>
> >> Both examples show the same thing, but with different data structures
> >> (list of int, dict of str:int). The only thing missing is that I left
> out
> >> the baseline in the second example, it would be the same as the first
> >> example.
> >>
> >> The other suggestions are spot on. If you want the memory to really be
> >> released, you have to do it in a transient subprocess, and/or you could
> >> probably have lower overhead if you're using a streaming parser (if
> there's
> >> something you can do with it incrementally).
> >>
> >> -bob
> >>
> >
> > Thank you all for the help.  Multiprocessing with a Queue and blocking
> get()
> > calls looks like it will work well.
> >
> > _______________________________________________
> > Baypiggies mailing list
> > Baypiggies at python.org
> > To change your subscription options or unsubscribe:
> > http://mail.python.org/mailman/listinfo/baypiggies
>


Lots of people have raised this idea in my hunt for answers.  However,
releasing memory to the OS appears to be objects dependent.  I assume this
is because different types use different memory allocators. Does anyone
have a deeper understanding of this?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/baypiggies/attachments/20120615/cdb65832/attachment-0001.html>


More information about the Baypiggies mailing list