[pypy-dev] Pypy garbage collection

Mon Mar 17 15:21:38 CET 2014

eh, this is not what I need

I need a max of TIME it took for a gc-minor and the TOTAL time it took
for a gc-minor (per query) (ideally same for gc-walkroots and
gc-collect-step)

On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak at issuu.com> wrote:
> Here are the collated results of running each query. For each run, I count
> how many of each of the pypy debug lines i get. I.e. there were 668 runs
> that printed 58 loglines that contain "{gc-minor" which was eventually
> followed by "gc-minor}". I have also counted if the query was slow;
> interestingly, not all the queries with many gc-minors were slow (but all
> slow queries had a gc-minor).
>
> Please let me know if this is unclear :)
>
>     668 gc-minor:58 gc-minor-walkroots:58
>      10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5
>     140 gc-minor:59 gc-minor-walkroots:59
>       1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403
>       1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249
>       9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589
>       1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590
>      10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609
>       1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614
>       1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58
>       1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78
> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030
> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6
> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1
> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2
> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2
> jit-resume:84
>       1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1
> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60
> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1
> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1
> jit-resume:14
>       1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3
> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3
> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2
> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3
> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2
> jit-log-compiling-bridge:2 jit-resume:84
>       2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3
> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3
> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2
> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3
> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2
> jit-log-compiling-bridge:2 jit-resume:84
>       1 jit-log-short-preamble:2 jit-log-compiling-loop:2
> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61
> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4
> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3
> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2
> jit-log-compiling-bridge:2 jit-resume:104
>
>
> Thanks,
> /Martin
>
>
>
> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall at gmail.com>
> wrote:
>>
>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall at gmail.com>
>> wrote:
>> > are you *sure* it's the walkroots that take that long and not
>> > something else (like gc-minor)? More of those mean that you allocate a
>> > lot more surviving objects. Can you do two things:
>> >
>> > a) take a max of gc-minor (and gc-minor-stackwalk), per request
>> > b) take the sum of those
>> >
>> > and plot them
>>
>> ^^^ or just paste the results actually
>>
>> >
>> > On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak at issuu.com> wrote:
>> >> Well, then it works out to around 2.5GHz, which seems reasonable. But
>> >> it
>> >> doesn't alter the conclusion from the previous email: The slow queries
>> >> then
>> >> all have a duration around 34*10^9 units, 'normal' queries 1*10^9
>> >> units, or
>> >> .4 seconds at this conversion. Also, the log shows that a slow query
>> >> performs many more gc-minor operations than a 'normal' one: 9600
>> >> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.
>> >>
>> >> So the question becomes: Why do we get this large spike in
>> >> gc-minor-walkroots, and, in particular, is there any way to avoid it :)
>> >> ?
>> >>
>> >> Thanks,
>> >> /Martin
>> >>
>> >>
>> >> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski <fijall at gmail.com>
>> >> wrote:
>> >>>
>> >>> I think it's the cycles of your CPU
>> >>>
>> >>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak at issuu.com> wrote:
>> >>> > What is the unit? Perhaps I'm being thick here, but I can't
>> >>> > correlate it
>> >>> > with seconds (which the program does print out). Slow runs are
>> >>> > around 13
>> >>> > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units
>> >>> > (e.g.
>> >>> > from
>> >>> > 0x2b994c9d31889c to 0x2b9944ab8c4f49).
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski
>> >>> > <fijall at gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> The number of lines is nonsense. This is a timestamp in hex.
>> >>> >>
>> >>> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak at issuu.com>
>> >>> >> wrote:
>> >>> >> > Based On Maciej's suggestion, I tried the following
>> >>> >> >
>> >>> >> > PYPYLOG=- pypy mem.py 10000000 > out
>> >>> >> >
>> >>> >> > This generates a logfile which looks something like this
>> >>> >> >
>> >>> >> > start-->
>> >>> >> > [2b99f1981b527e] {gc-minor
>> >>> >> > [2b99f1981ba680] {gc-minor-walkroots
>> >>> >> > [2b99f1981c2e02] gc-minor-walkroots}
>> >>> >> > [2b99f19890d750] gc-minor}
>> >>> >> > [snip]
>> >>> >> > ...
>> >>> >> > <--stop
>> >>> >> >
>> >>> >> >
>> >>> >> > It turns out that the culprit is a lot of MINOR collections.
>> >>> >> >
>> >>> >> > I base this on the following observations:
>> >>> >> >
>> >>> >> > I can't understand the format of the timestamp on each logline
>> >>> >> > (the
>> >>> >> > "[2b99f1981b527e]"). From what I can see in the code, this should
>> >>> >> > be
>> >>> >> > output
>> >>> >> > from time.clock(), but that doesn't return a number like that
>> >>> >> > when I
>> >>> >> > run
>> >>> >> > pypy interactively
>> >>> >> > Instead, I count the number of debug lines between start--> and
>> >>> >> > the
>> >>> >> > corresponding <--stop.
>> >>> >> > Most runs have a few hundred lines of output between start/stop
>> >>> >> > All slow runs have very close to 57800 lines out output between
>> >>> >> > start/stop
>> >>> >> > One such sample does 9609 gc-collect-step operations, 9647
>> >>> >> > gc-minor
>> >>> >> > operations, and 9647 gc-minor-walkroots operations.
>> >>> >> >
>> >>> >> >
>> >>> >> > Thanks,
>> >>> >> > /Martin
>> >>> >> >
>> >>> >> >
>> >>> >> > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski
>> >>> >> > <fijall at gmail.com>
>> >>> >> > wrote:
>> >>> >> >>
>> >>> >> >> there is an environment variable PYPYLOG=gc:- (where - is
>> >>> >> >> stdout)
>> >>> >> >> which will do that for you btw.
>> >>> >> >>
>> >>> >> >> maybe you can find out what's that using profiling or valgrind?
>> >>> >> >>
>> >>> >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak at issuu.com>
>> >>> >> >> wrote:
>> >>> >> >> > I have tried getting the pypy source and building my own
>> >>> >> >> > version
>> >>> >> >> > of
>> >>> >> >> > pypy. I
>> >>> >> >> > have modified
>> >>> >> >> > rpython/memory/gc/incminimark.py:major_collection_step()
>> >>> >> >> > to
>> >>> >> >> > print out when it starts and when it stops. Apparently, the
>> >>> >> >> > slow
>> >>> >> >> > queries
>> >>> >> >> > do
>> >>> >> >> > NOT occur during major_collection_step; at least, I have not
>> >>> >> >> > observed
>> >>> >> >> > major
>> >>> >> >> > step output during a query execution. So, apparently,
>> >>> >> >> > something
>> >>> >> >> > else
>> >>> >> >> > is
>> >>> >> >> > blocking. This could be another aspect of the GC, but it could
>> >>> >> >> > also
>> >>> >> >> > be
>> >>> >> >> > anything else.
>> >>> >> >> >
>> >>> >> >> > Just to be sure, I have tried running the same application in
>> >>> >> >> > python
>> >>> >> >> > with
>> >>> >> >> > garbage collection disabled. I don't see the problem there, so
>> >>> >> >> > it
>> >>> >> >> > is
>> >>> >> >> > somehow
>> >>> >> >> > related to either GC or the runtime somehow.
>> >>> >> >> >
>> >>> >> >> > Cheers,
>> >>> >> >> > /Martin
>> >>> >> >> >
>> >>> >> >> >
>> >>> >> >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak at issuu.com>
>> >>> >> >> > wrote:
>> >>> >> >> >>
>> >>> >> >> >> We have hacked up a small sample that seems to exhibit the
>> >>> >> >> >> same
>> >>> >> >> >> issue.
>> >>> >> >> >>
>> >>> >> >> >> We basically generate a linked list of objects. To increase
>> >>> >> >> >> connectedness,
>> >>> >> >> >> elements in the list hold references (dummy_links) to 10
>> >>> >> >> >> randomly
>> >>> >> >> >> chosen
>> >>> >> >> >> previous elements in the list.
>> >>> >> >> >>
>> >>> >> >> >> We then time a function that traverses 50000 elements from
>> >>> >> >> >> the
>> >>> >> >> >> list
>> >>> >> >> >> from a
>> >>> >> >> >> random start point. If the traversal reaches the end of the
>> >>> >> >> >> list,
>> >>> >> >> >> we
>> >>> >> >> >> instead
>> >>> >> >> >> traverse one of the dummy links. Thus, exactly 50K elements
>> >>> >> >> >> are
>> >>> >> >> >> traversed
>> >>> >> >> >> every time. To generate some garbage, we build a list holding
>> >>> >> >> >> the
>> >>> >> >> >> traversed
>> >>> >> >> >> elements and a dummy list of characters.
>> >>> >> >> >>
>> >>> >> >> >> Timings for the last 100 runs are stored in a circular
>> >>> >> >> >> buffer. If
>> >>> >> >> >> the
>> >>> >> >> >> elapsed time for the last run is more than twice the average
>> >>> >> >> >> time,
>> >>> >> >> >> we
>> >>> >> >> >> print
>> >>> >> >> >> out a line with the elapsed time, the threshold, and the 90%
>> >>> >> >> >> runtime
>> >>> >> >> >> (we
>> >>> >> >> >> would like to see that the mean runtime does not increase
>> >>> >> >> >> with
>> >>> >> >> >> the
>> >>> >> >> >> number of
>> >>> >> >> >> elements in the list, but that the max time does increase
>> >>> >> >> >> (linearly
>> >>> >> >> >> with the
>> >>> >> >> >> number of object, i guess); traversing 50K elements should be
>> >>> >> >> >> independent of
>> >>> >> >> >> the memory size).
>> >>> >> >> >>
>> >>> >> >> >> We have tried monitoring memory consumption by external
>> >>> >> >> >> inspection,
>> >>> >> >> >> but
>> >>> >> >> >> cannot consistently verify that memory is deallocated at the
>> >>> >> >> >> same
>> >>> >> >> >> time
>> >>> >> >> >> that
>> >>> >> >> >> we see slow requests. Perhaps the pypy runtime doesn't always
>> >>> >> >> >> return
>> >>> >> >> >> freed
>> >>> >> >> >> pages back to the OS?
>> >>> >> >> >>
>> >>> >> >> >> Using top, we observe that 10M elements allocates around 17GB
>> >>> >> >> >> after
>> >>> >> >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to
>> >>> >> >> >> 35GB
>> >>> >> >> >> shortly
>> >>> >> >> >> after building).
>> >>> >> >> >>
>> >>> >> >> >> Here is output from a few runs with different number of
>> >>> >> >> >> elements:
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> pypy mem.py 10000000
>> >>> >> >> >> start build
>> >>> >> >> >> end build 84.142424
>> >>> >> >> >> that took a long time elapsed: 13.230586  slow_threshold:
>> >>> >> >> >> 1.495401
>> >>> >> >> >> 90th_quantile_runtime: 0.421558
>> >>> >> >> >> that took a long time elapsed: 13.016531  slow_threshold:
>> >>> >> >> >> 1.488160
>> >>> >> >> >> 90th_quantile_runtime: 0.423441
>> >>> >> >> >> that took a long time elapsed: 13.032537  slow_threshold:
>> >>> >> >> >> 1.474563
>> >>> >> >> >> 90th_quantile_runtime: 0.419817
>> >>> >> >> >>
>> >>> >> >> >> pypy mem.py 20000000
>> >>> >> >> >> start build
>> >>> >> >> >> end build 180.823105
>> >>> >> >> >> that took a long time elapsed: 27.346064  slow_threshold:
>> >>> >> >> >> 2.295146
>> >>> >> >> >> 90th_quantile_runtime: 0.434726
>> >>> >> >> >> that took a long time elapsed: 26.028852  slow_threshold:
>> >>> >> >> >> 2.283927
>> >>> >> >> >> 90th_quantile_runtime: 0.374190
>> >>> >> >> >> that took a long time elapsed: 25.432279  slow_threshold:
>> >>> >> >> >> 2.279631
>> >>> >> >> >> 90th_quantile_runtime: 0.371502
>> >>> >> >> >>
>> >>> >> >> >> pypy mem.py 30000000
>> >>> >> >> >> start build
>> >>> >> >> >> end build 276.217811
>> >>> >> >> >> that took a long time elapsed: 40.993855  slow_threshold:
>> >>> >> >> >> 3.188464
>> >>> >> >> >> 90th_quantile_runtime: 0.459891
>> >>> >> >> >> that took a long time elapsed: 41.693553  slow_threshold:
>> >>> >> >> >> 3.183003
>> >>> >> >> >> 90th_quantile_runtime: 0.393654
>> >>> >> >> >> that took a long time elapsed: 39.679769  slow_threshold:
>> >>> >> >> >> 3.190782
>> >>> >> >> >> 90th_quantile_runtime: 0.393677
>> >>> >> >> >> that took a long time elapsed: 43.573411  slow_threshold:
>> >>> >> >> >> 3.239637
>> >>> >> >> >> 90th_quantile_runtime: 0.393654
>> >>> >> >> >>
>> >>> >> >> >> Code below
>> >>> >> >> >>
>> >>> >> >> >> --------------------------------------------------------------
>> >>> >> >> >> import time
>> >>> >> >> >> from random import randint, choice
>> >>> >> >> >> import sys
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> allElems = {}
>> >>> >> >> >>
>> >>> >> >> >> class Node:
>> >>> >> >> >>     def __init__(self, v_):
>> >>> >> >> >>         self.v = v_
>> >>> >> >> >>         self.next = None
>> >>> >> >> >>         self.dummy_data = [randint(0,100)
>> >>> >> >> >>                            for _ in xrange(randint(50,100))]
>> >>> >> >> >>         allElems[self.v] = self
>> >>> >> >> >>         if self.v > 0:
>> >>> >> >> >>             self.dummy_links = [allElems[randint(0,
>> >>> >> >> >> self.v-1)]
>> >>> >> >> >> for _
>> >>> >> >> >> in
>> >>> >> >> >> xrange(10)]
>> >>> >> >> >>         else:
>> >>> >> >> >>             self.dummy_links = [self]
>> >>> >> >> >>
>> >>> >> >> >>     def set_next(self, l):
>> >>> >> >> >>         self.next = l
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> def follow(node):
>> >>> >> >> >>     acc = []
>> >>> >> >> >>     count = 0
>> >>> >> >> >>     cur = node
>> >>> >> >> >>     assert node.v is not None
>> >>> >> >> >>     assert cur is not None
>> >>> >> >> >>     while count < 50000:
>> >>> >> >> >>         # return a value; generate some garbage
>> >>> >> >> >>         acc.append((cur.v,
>> >>> >> >> >> [choice("abcdefghijklmnopqrstuvwxyz")
>> >>> >> >> >> for
>> >>> >> >> >> x
>> >>> >> >> >> in
>> >>> >> >> >> xrange(100)]))
>> >>> >> >> >>
>> >>> >> >> >>         # if we have reached the end, chose a random link
>> >>> >> >> >>         cur = choice(cur.dummy_links) if cur.next is None
>> >>> >> >> >> else
>> >>> >> >> >> cur.next
>> >>> >> >> >>         count += 1
>> >>> >> >> >>
>> >>> >> >> >>     return acc
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> def build(num_elems):
>> >>> >> >> >>     start = time.time()
>> >>> >> >> >>     print "start build"
>> >>> >> >> >>     root = Node(0)
>> >>> >> >> >>     cur = root
>> >>> >> >> >>     for x in xrange(1, num_elems):
>> >>> >> >> >>         e = Node(x)
>> >>> >> >> >>         cur.next = e
>> >>> >> >> >>         cur = e
>> >>> >> >> >>     print "end build %f" % (time.time() - start)
>> >>> >> >> >>     return root
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> num_timings = 100
>> >>> >> >> >> if __name__ == "__main__":
>> >>> >> >> >>     num_elems = int(sys.argv[1])
>> >>> >> >> >>     build(num_elems)
>> >>> >> >> >>     total = 0
>> >>> >> >> >>     timings = [0.0] * num_timings # run times for the last
>> >>> >> >> >> num_timings
>> >>> >> >> >> runs
>> >>> >> >> >>     i = 0
>> >>> >> >> >>     beginning = time.time()
>> >>> >> >> >>     while time.time() - beginning < 600:
>> >>> >> >> >>         start = time.time()
>> >>> >> >> >>         elem = allElems[randint(0, num_elems - 1)]
>> >>> >> >> >>         assert(elem is not None)
>> >>> >> >> >>
>> >>> >> >> >>         lst = follow(elem)
>> >>> >> >> >>
>> >>> >> >> >>         total += choice(lst)[0] # use the return value for
>> >>> >> >> >> something
>> >>> >> >> >>
>> >>> >> >> >>         end = time.time()
>> >>> >> >> >>
>> >>> >> >> >>         elapsed = end-start
>> >>> >> >> >>         timings[i % num_timings] = elapsed
>> >>> >> >> >>         if (i > num_timings):
>> >>> >> >> >>             slow_time = 2 * sum(timings)/num_timings # slow
>> >>> >> >> >> defined
>> >>> >> >> >> as
>> >>> >> >> >> >
>> >>> >> >> >> 2*avg run time
>> >>> >> >> >>             if (elapsed > slow_time):
>> >>> >> >> >>                 print "that took a long time elapsed: %f
>> >>> >> >> >> slow_threshold:
>> >>> >> >> >> %f  90th_quantile_runtime: %f" % \
>> >>> >> >> >>                     (elapsed, slow_time,
>> >>> >> >> >> sorted(timings)[int(num_timings*.9)])
>> >>> >> >> >>         i += 1
>> >>> >> >> >>     print total
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski
>> >>> >> >> >> <fijall at gmail.com>
>> >>> >> >> >> wrote:
>> >>> >> >> >>>
>> >>> >> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak at issuu.com>
>> >>> >> >> >>> wrote:
>> >>> >> >> >>> > Hi Armin, Maciej
>> >>> >> >> >>> >
>> >>> >> >> >>> > Thanks for responding.
>> >>> >> >> >>> >
>> >>> >> >> >>> > I'm in the process of trying to determine what (if any) of
>> >>> >> >> >>> > the
>> >>> >> >> >>> > code
>> >>> >> >> >>> > I'm
>> >>> >> >> >>> > in a
>> >>> >> >> >>> > position to share, and I'll get back to you.
>> >>> >> >> >>> >
>> >>> >> >> >>> > Allowing hinting to the GC would be good. Even better
>> >>> >> >> >>> > would be
>> >>> >> >> >>> > a
>> >>> >> >> >>> > means
>> >>> >> >> >>> > to
>> >>> >> >> >>> > allow me to (transparently) allocate objects in unmanaged
>> >>> >> >> >>> > memory,
>> >>> >> >> >>> > but I
>> >>> >> >> >>> > would expect that to be a tall order :)
>> >>> >> >> >>> >
>> >>> >> >> >>> > Thanks,
>> >>> >> >> >>> > /Martin
>> >>> >> >> >>>
>> >>> >> >> >>> Hi Martin.
>> >>> >> >> >>>
>> >>> >> >> >>> Note that in case you want us to do the work of isolating
>> >>> >> >> >>> the
>> >>> >> >> >>> problem,
>> >>> >> >> >>> we do offer paid support to do that (then we can sign NDAs
>> >>> >> >> >>> and
>> >>> >> >> >>> stuff).
>> >>> >> >> >>> Otherwise we would be more than happy to fix bugs once you
>> >>> >> >> >>> isolate
>> >>> >> >> >>> a
>> >>> >> >> >>> part you can share freely :)
>> >>> >> >> >>
>> >>> >> >> >>
>> >>> >> >> >
>> >>> >> >
>> >>> >> >
>> >>> >
>> >>> >
>> >>
>> >>
>
>