Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow. *Totals*: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 *Max*: gc-minor:10 gc-collect-step:247 *Totals*: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 *Max*: gc-minor:10 gc-collect-step:245 *Totals*: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 *Max*: gc-minor:11 gc-collect-step:244 *Totals*: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 *Max*: gc-minor:17 gc-collect-step:244 *Totals*: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 *Max*: gc-minor:11 gc-collect-step:248 *Totals*: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 *Max*: gc-minor:8 gc-collect-step:299 *Totals*: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 *Max*: gc-minor:11 gc-collect-step:246 *Totals*: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 *Max*: gc-minor:36 gc-collect-step:248 *Totals*: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 *Max*: gc-minor:38 gc-collect-step:244 *Totals*: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 *Max*: gc-minor:23 gc-collect-step:245 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 *Max*: gc-minor:9 gc-collect-step:244 *Totals*: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 *Max*: gc-minor:8 gc-collect-step:248 *Totals*: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 *Max*: gc-minor:8 gc-collect-step:250 *Totals*: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 *Max*: gc-minor:543 gc-collect-step:244 *Totals*: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 *Max*: gc-minor:20 gc-collect-step:246 *Totals*: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 *Max*: gc-minor:25 gc-collect-step:245 Thanks, /Martin On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch <mak@issuu.com> wrote:
Ah. I had misunderstood. I'll get back to you on that :) thanks
/Martin
On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> wrote:
eh, this is not what I need
I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step)
On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor).
Please let me know if this is unclear :)
668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104
Thanks, /Martin
On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:
On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:
are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things:
a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those
and plot them
^^^ or just paste the results actually
On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote: Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow
queries
then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.
So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ?
Thanks, /Martin
On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski < fijall@gmail.com> wrote: > > I think it's the cycles of your CPU > >> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote: >> What is the unit? Perhaps I'm being thick here, but I can't >> correlate it >> with seconds (which the program does print out). Slow runs are >> around 13 >> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >> (e.g. >> from >> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >> >> >> >> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >> <fijall@gmail.com> >> wrote: >>> >>> The number of lines is nonsense. This is a timestamp in hex. >>> >>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> >>> wrote: >>>> Based On Maciej's suggestion, I tried the following >>>> >>>> PYPYLOG=- pypy mem.py 10000000 > out >>>> >>>> This generates a logfile which looks something like this >>>> >>>> start--> >>>> [2b99f1981b527e] {gc-minor >>>> [2b99f1981ba680] {gc-minor-walkroots >>>> [2b99f1981c2e02] gc-minor-walkroots} >>>> [2b99f19890d750] gc-minor} >>>> [snip] >>>> ... >>>> <--stop >>>> >>>> >>>> It turns out that the culprit is a lot of MINOR collections. >>>> >>>> I base this on the following observations: >>>> >>>> I can't understand the format of the timestamp on each logline >>>> (the >>>> "[2b99f1981b527e]"). From what I can see in the code, this should >>>> be >>>> output >>>> from time.clock(), but that doesn't return a number like that >>>> when I >>>> run >>>> pypy interactively >>>> Instead, I count the number of debug lines between start--> and >>>> the >>>> corresponding <--stop. >>>> Most runs have a few hundred lines of output between start/stop >>>> All slow runs have very close to 57800 lines out output between >>>> start/stop >>>> One such sample does 9609 gc-collect-step operations, 9647 >>>> gc-minor >>>> operations, and 9647 gc-minor-walkroots operations. >>>> >>>> >>>> Thanks, >>>> /Martin >>>> >>>> >>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>> <fijall@gmail.com> >>>> wrote: >>>>> >>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>> stdout) >>>>> which will do that for you btw. >>>>> >>>>> maybe you can find out what's that using profiling or valgrind? >>>>> >>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> >>>>> wrote: >>>>>> I have tried getting the pypy source and building my own >>>>>> version >>>>>> of >>>>>> pypy. I >>>>>> have modified >>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>> to >>>>>> print out when it starts and when it stops. Apparently, the >>>>>> slow >>>>>> queries >>>>>> do >>>>>> NOT occur during major_collection_step; at least, I have not >>>>>> observed >>>>>> major >>>>>> step output during a query execution. So, apparently, >>>>>> something >>>>>> else >>>>>> is >>>>>> blocking. This could be another aspect of the GC, but it could >>>>>> also >>>>>> be >>>>>> anything else. >>>>>> >>>>>> Just to be sure, I have tried running the same application in >>>>>> python >>>>>> with >>>>>> garbage collection disabled. I don't see the problem there, so >>>>>> it >>>>>> is >>>>>> somehow >>>>>> related to either GC or the runtime somehow. >>>>>> >>>>>> Cheers, >>>>>> /Martin >>>>>> >>>>>> >>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >>>>>> wrote: >>>>>>> >>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>> same >>>>>>> issue. >>>>>>> >>>>>>> We basically generate a linked list of objects. To increase >>>>>>> connectedness, >>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>> randomly >>>>>>> chosen >>>>>>> previous elements in the list. >>>>>>> >>>>>>> We then time a function that traverses 50000 elements from >>>>>>> the >>>>>>> list >>>>>>> from a >>>>>>> random start point. If the traversal reaches the end of the >>>>>>> list, >>>>>>> we >>>>>>> instead >>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>> are >>>>>>> traversed >>>>>>> every time. To generate some garbage, we build a list holding >>>>>>> the >>>>>>> traversed >>>>>>> elements and a dummy list of characters. >>>>>>> >>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>> buffer. If >>>>>>> the >>>>>>> elapsed time for the last run is more than twice the average >>>>>>> time, >>>>>>> we >>>>>>> print >>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>> runtime >>>>>>> (we >>>>>>> would like to see that the mean runtime does not increase >>>>>>> with >>>>>>> the >>>>>>> number of >>>>>>> elements in the list, but that the max time does increase >>>>>>> (linearly >>>>>>> with the >>>>>>> number of object, i guess); traversing 50K elements should be >>>>>>> independent of >>>>>>> the memory size). >>>>>>> >>>>>>> We have tried monitoring memory consumption by external >>>>>>> inspection, >>>>>>> but >>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>> same >>>>>>> time >>>>>>> that >>>>>>> we see slow requests. Perhaps the pypy runtime doesn't always >>>>>>> return >>>>>>> freed >>>>>>> pages back to the OS? >>>>>>> >>>>>>> Using top, we observe that 10M elements allocates around 17GB >>>>>>> after >>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>> 35GB >>>>>>> shortly >>>>>>> after building). >>>>>>> >>>>>>> Here is output from a few runs with different number of >>>>>>> elements: >>>>>>> >>>>>>> >>>>>>> pypy mem.py 10000000 >>>>>>> start build >>>>>>> end build 84.142424 >>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>> 1.495401 >>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>> 1.488160 >>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>> 1.474563 >>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>> >>>>>>> pypy mem.py 20000000 >>>>>>> start build >>>>>>> end build 180.823105 >>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>> 2.295146 >>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>> 2.283927 >>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>> 2.279631 >>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>> >>>>>>> pypy mem.py 30000000 >>>>>>> start build >>>>>>> end build 276.217811 >>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>> 3.188464 >>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>> 3.183003 >>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>> 3.190782 >>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>> 3.239637 >>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>> >>>>>>> Code below >>>>>>> >>>>>>> -------------------------------------------------------------- >>>>>>> import time >>>>>>> from random import randint, choice >>>>>>> import sys >>>>>>> >>>>>>> >>>>>>> allElems = {} >>>>>>> >>>>>>> class Node: >>>>>>> def __init__(self, v_): >>>>>>> self.v = v_ >>>>>>> self.next = None >>>>>>> self.dummy_data = [randint(0,100) >>>>>>> for _ in xrange(randint(50,100))] >>>>>>> allElems[self.v] = self >>>>>>> if self.v > 0: >>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>> self.v-1)] >>>>>>> for _ >>>>>>> in >>>>>>> xrange(10)] >>>>>>> else: >>>>>>> self.dummy_links = [self] >>>>>>> >>>>>>> def set_next(self, l): >>>>>>> self.next = l >>>>>>> >>>>>>> >>>>>>> def follow(node): >>>>>>> acc = [] >>>>>>> count = 0 >>>>>>> cur = node >>>>>>> assert node.v is not None >>>>>>> assert cur is not None >>>>>>> while count < 50000: >>>>>>> # return a value; generate some garbage >>>>>>> acc.append((cur.v, >>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>> for >>>>>>> x >>>>>>> in >>>>>>> xrange(100)])) >>>>>>> >>>>>>> # if we have reached the end, chose a random link >>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>> else >>>>>>> cur.next >>>>>>> count += 1 >>>>>>> >>>>>>> return acc >>>>>>> >>>>>>> >>>>>>> def build(num_elems): >>>>>>> start = time.time() >>>>>>> print "start build" >>>>>>> root = Node(0) >>>>>>> cur = root >>>>>>> for x in xrange(1, num_elems): >>>>>>> e = Node(x) >>>>>>> cur.next = e >>>>>>> cur = e >>>>>>> print "end build %f" % (time.time() - start) >>>>>>> return root >>>>>>> >>>>>>> >>>>>>> num_timings = 100 >>>>>>> if __name__ == "__main__": >>>>>>> num_elems = int(sys.argv[1]) >>>>>>> build(num_elems) >>>>>>> total = 0 >>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>> num_timings >>>>>>> runs >>>>>>> i = 0 >>>>>>> beginning = time.time() >>>>>>> while time.time() - beginning < 600: >>>>>>> start = time.time() >>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>> assert(elem is not None) >>>>>>> >>>>>>> lst = follow(elem) >>>>>>> >>>>>>> total += choice(lst)[0] # use the return value for >>>>>>> something >>>>>>> >>>>>>> end = time.time() >>>>>>> >>>>>>> elapsed = end-start >>>>>>> timings[i % num_timings] = elapsed >>>>>>> if (i > num_timings): >>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>> defined >>>>>>> as >>>>>>>> >>>>>>> 2*avg run time >>>>>>> if (elapsed > slow_time): >>>>>>> print "that took a long time elapsed: %f >>>>>>> slow_threshold: >>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>> (elapsed, slow_time, >>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>> i += 1 >>>>>>> print total >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>> <fijall@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> >>>>>>>> wrote: >>>>>>>>> Hi Armin, Maciej >>>>>>>>> >>>>>>>>> Thanks for responding. >>>>>>>>> >>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>> the >>>>>>>>> code >>>>>>>>> I'm >>>>>>>>> in a >>>>>>>>> position to share, and I'll get back to you. >>>>>>>>> >>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>> would be >>>>>>>>> a >>>>>>>>> means >>>>>>>>> to >>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>> memory, >>>>>>>>> but I >>>>>>>>> would expect that to be a tall order :) >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> /Martin >>>>>>>> >>>>>>>> Hi Martin. >>>>>>>> >>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>> the >>>>>>>> problem, >>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>> and >>>>>>>> stuff). >>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>> isolate >>>>>>>> a >>>>>>>> part you can share freely :) >>>>>>> >>>>>>> >>>>>> >>>> >>>> >> >>