Mailman 3 Pypy garbage collection - pypy-dev

Pypy garbage collection

older
Re: [pypy-dev] RinohType and PyPy2

Martin Koch

March 12, 2014

10:06 p.m.

Hi List I'm running a server (written in python, executed with pypy) that holds a large graph (55GB, millions of nodes and edges) in memory and responds to queries by traversing the graph.The graph is mutated a few times a second, and there are hundreds of read-only requests a second. My problem is that I no control over garbage collection. Thus, a major GC might kick in while serving a query, and with this amount of data, the GC takes around 2 minutes. I have tried mitigating this by guessing when a GC might be due, and proactively starting the garbage collector while not serving a request (this is ok, as duplicate servers will respond to requests while this one is collecting). What I would really like is to be able to disable garbage collection for the old generation. This is because the graph is fairly static, and I can live with leaking memory from the relatively few and small mutations that occur. Any queries are only likely to generate objects in the new generation, and it is fine to collect these. Also, by design, the process is periodically restarted in order to re-synchronize it with an authoritative source (thus rebuilding the graph from scratch), so slight leakage is not an issue here. I have tried experimenting with setting environment<http://pypy.readthedocs.org/en/latest/gc_info.html>variables as well as the 'gc' module, but nothing seems to give me what I want. If disabling gc for certain generations is not possible, it would be nice to be able to get a hint when a major collection is about to occur, so I can stop serving requests. I'm using the following pypy version: Python 2.7.3 (2.2.1+dfsg-1, Jan 24 2014, 10:12:37) [PyPy 2.2.1 with GCC 4.6.3] on linux2 An additional question: pypy 2.2.1 should have incremental GC; shouldn't that avoid long pauses due to garbage collection? Thanks, /Martin Koch - Senior Systems Architect - issuu.com

Attachments:

attachment.html (text/html — 2.2 KB)

Show replies by date

Maciej Fijalkowski

March 2014

11:56 p.m.

On Thu, Mar 13, 2014 at 12:06 AM, Martin Koch <mak@issuu.com> wrote:

...

Hi List

I'm running a server (written in python, executed with pypy) that holds a large graph (55GB, millions of nodes and edges) in memory and responds to queries by traversing the graph.The graph is mutated a few times a second, and there are hundreds of read-only requests a second.

My problem is that I no control over garbage collection. Thus, a major GC might kick in while serving a query, and with this amount of data, the GC takes around 2 minutes. I have tried mitigating this by guessing when a GC might be due, and proactively starting the garbage collector while not serving a request (this is ok, as duplicate servers will respond to requests while this one is collecting).

What I would really like is to be able to disable garbage collection for the old generation. This is because the graph is fairly static, and I can live with leaking memory from the relatively few and small mutations that occur. Any queries are only likely to generate objects in the new generation, and it is fine to collect these. Also, by design, the process is periodically restarted in order to re-synchronize it with an authoritative source (thus rebuilding the graph from scratch), so slight leakage is not an issue here.

I have tried experimenting with setting environment variables as well as the 'gc' module, but nothing seems to give me what I want.

If disabling gc for certain generations is not possible, it would be nice to be able to get a hint when a major collection is about to occur, so I can stop serving requests.

I'm using the following pypy version: Python 2.7.3 (2.2.1+dfsg-1, Jan 24 2014, 10:12:37) [PyPy 2.2.1 with GCC 4.6.3] on linux2

An additional question: pypy 2.2.1 should have incremental GC; shouldn't that avoid long pauses due to garbage collection?

Yes, it totally should. If your pauses are not incremental, we would like to be able to execute it. Since it's 55G, do you think you can make us an example that can run on a normal machine?

Armin Rigo

11:29 a.m.

Hi Martin, On 13 March 2014 00:56, Maciej Fijalkowski <fijall@gmail.com> wrote:

...

Yes, it totally should. If your pauses are not incremental, we would like to be able to execute it. Since it's 55G, do you think you can make us an example that can run on a normal machine?

I think the request is not very clear. We do have a machine with 100GB of RAM, so that part should not be a problem. The question of Maciej can probably be rephrased as: can you give us a reproducible example? Even if the large pauses appear to occur on any example you try (which they shouldn't), please give us one such example. Also, maybe we should have anyway a way to give the GC a hint: "now is a good time to run if you need to". A bientôt, Armin.

Martin Koch

11:45 a.m.

Hi Armin, Maciej Thanks for responding. I'm in the process of trying to determine what (if any) of the code I'm in a position to share, and I'll get back to you. Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :) Thanks, /Martin On Thu, Mar 13, 2014 at 12:29 PM, Armin Rigo <arigo@tunes.org> wrote:

...

Hi Martin,

On 13 March 2014 00:56, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
Yes, it totally should. If your pauses are not incremental, we would like to be able to execute it. Since it's 55G, do you think you can make us an example that can run on a normal machine?

I think the request is not very clear. We do have a machine with 100GB of RAM, so that part should not be a problem. The question of Maciej can probably be rephrased as: can you give us a reproducible example? Even if the large pauses appear to occur on any example you try (which they shouldn't), please give us one such example.

Also, maybe we should have anyway a way to give the GC a hint: "now is a good time to run if you need to".

A bientôt,

Armin.

Maciej Fijalkowski

6:45 p.m.

On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote:

...

Hi Armin, Maciej

Thanks for responding.

I'm in the process of trying to determine what (if any) of the code I'm in a position to share, and I'll get back to you.

Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :)

Thanks, /Martin

Hi Martin. Note that in case you want us to do the work of isolating the problem, we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :)

Martin Koch

3:19 p.m.

We have hacked up a small sample that seems to exhibit the same issue. We basically generate a linked list of objects. To increase connectedness, elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list. We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are traversed every time. To generate some garbage, we build a list holding the traversed elements and a dummy list of characters. Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we print out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size). We have tried monitoring memory consumption by external inspection, but cannot consistently verify that memory is deallocated at the same time that we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS? Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building). Here is output from a few runs with different number of elements: *pypy mem.py 10000000* start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817 *pypy mem.py 20000000* start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502 *pypy mem.py 30000000* start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654 *Code below* *--------------------------------------------------------------* import time from random import randint, choice import sys allElems = {} class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self] def set_next(self, l): self.next = l def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)])) # if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1 return acc def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None) lst = follow(elem) total += choice(lst)[0] # use the return value for something end = time.time() elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as > 2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski <fijall@gmail.com>wrote:

...

On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote:

...
Hi Armin, Maciej

Thanks for responding.

I'm in the process of trying to determine what (if any) of the code I'm in a position to share, and I'll get back to you.

Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :)

Thanks, /Martin

Hi Martin.

Note that in case you want us to do the work of isolating the problem, we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :)

Martin Koch

9:34 p.m.

I have tried getting the pypy source and building my own version of pypy. I have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow queries do NOT occur during major_collection_step; at least, I have not observed major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else. Just to be sure, I have tried running the same application in python with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow. Cheers, /Martin On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote:

...

We have hacked up a small sample that seems to exhibit the same issue.

We basically generate a linked list of objects. To increase connectedness, elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list.

We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are traversed every time. To generate some garbage, we build a list holding the traversed elements and a dummy list of characters.

Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we print out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size).

We have tried monitoring memory consumption by external inspection, but cannot consistently verify that memory is deallocated at the same time that we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS?

Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building).

Here is output from a few runs with different number of elements:

*pypy mem.py 10000000* start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817

*pypy mem.py 20000000* start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502

*pypy mem.py 30000000* start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654

*Code below* *--------------------------------------------------------------* import time from random import randint, choice import sys

allElems = {}

class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self]

def set_next(self, l): self.next = l

def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)]))

# if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1

return acc

def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root

num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None)

lst = follow(elem)

total += choice(lst)[0] # use the return value for something

end = time.time()

elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as > 2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total

On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski <fijall@gmail.com>wrote:

...
On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote:

...
Hi Armin, Maciej

Thanks for responding.

I'm in the process of trying to determine what (if any) of the code I'm in a position to share, and I'll get back to you.

Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :)

Thanks, /Martin

Hi Martin.

Note that in case you want us to do the work of isolating the problem, we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :)

Maciej Fijalkowski

7:21 a.m.

there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw. maybe you can find out what's that using profiling or valgrind? On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote:

...

I have tried getting the pypy source and building my own version of pypy. I have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow queries do NOT occur during major_collection_step; at least, I have not observed major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else.

Just to be sure, I have tried running the same application in python with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow.

Cheers, /Martin

On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote:

...
We have hacked up a small sample that seems to exhibit the same issue.

We basically generate a linked list of objects. To increase connectedness, elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list.

We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are traversed every time. To generate some garbage, we build a list holding the traversed elements and a dummy list of characters.

Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we print out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size).

We have tried monitoring memory consumption by external inspection, but cannot consistently verify that memory is deallocated at the same time that we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS?

Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building).

Here is output from a few runs with different number of elements:

pypy mem.py 10000000 start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817

pypy mem.py 20000000 start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502

pypy mem.py 30000000 start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654

Code below -------------------------------------------------------------- import time from random import randint, choice import sys

allElems = {}

class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self]

def set_next(self, l): self.next = l

def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)]))

# if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1

return acc

def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root

num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None)

lst = follow(elem)

total += choice(lst)[0] # use the return value for something

end = time.time()

elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as > 2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total

On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote:

...
Hi Armin, Maciej

Thanks for responding.

I'm in the process of trying to determine what (if any) of the code I'm in a position to share, and I'll get back to you.

Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :)

Thanks, /Martin

Hi Martin.

Note that in case you want us to do the work of isolating the problem, we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :)

Martin Koch

10:46 a.m.

Based On Maciej's suggestion, I tried the following PYPYLOG=- pypy mem.py 10000000 > out This generates a logfile which looks something like this start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop It turns out that the culprit is a lot of MINOR collections. I base this on the following observations: - I can't understand the format of the timestamp on each logline (the " [2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I run pypy interactively - Instead, I count the number of debug lines between start--> and the corresponding <--stop. - Most runs have a few hundred lines of output between start/stop - All slow runs have very close to 57800 lines out output between start/stop - One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations. Thanks, /Martin On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski <fijall@gmail.com>wrote:

...

there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw.

maybe you can find out what's that using profiling or valgrind?

...
I have tried getting the pypy source and building my own version of

...
have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow queries do NOT occur during major_collection_step; at least, I have not observed major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else.

Just to be sure, I have tried running the same application in python with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow.

Cheers, /Martin

On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote:

...
We have hacked up a small sample that seems to exhibit the same issue.

We basically generate a linked list of objects. To increase

connectedness,

...
elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list.

We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are

...
...
every time. To generate some garbage, we build a list holding the

...
...
elements and a dummy list of characters.

Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we

...
...
out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size).

We have tried monitoring memory consumption by external inspection, but cannot consistently verify that memory is deallocated at the same time

On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote: pypy. I traversed traversed print that

...
...
we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS?

Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building).

Here is output from a few runs with different number of elements:

pypy mem.py 10000000 start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817

pypy mem.py 20000000 start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502

pypy mem.py 30000000 start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654

Code below -------------------------------------------------------------- import time from random import randint, choice import sys

allElems = {}

class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self]

def set_next(self, l): self.next = l

def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)]))

# if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1

return acc

def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root

num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None)

lst = follow(elem)

total += choice(lst)[0] # use the return value for something

end = time.time()

elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as > 2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total

On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote:

...
Hi Armin, Maciej

Thanks for responding.

I'm in the process of trying to determine what (if any) of the code

I'm

...
...
in a position to share, and I'll get back to you.

Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :)

Thanks, /Martin

Hi Martin.

Note that in case you want us to do the work of isolating the problem, we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :)

Maciej Fijalkowski

11:09 a.m.

The number of lines is nonsense. This is a timestamp in hex. On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> wrote:

...

Based On Maciej's suggestion, I tried the following

PYPYLOG=- pypy mem.py 10000000 > out

This generates a logfile which looks something like this

start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop

It turns out that the culprit is a lot of MINOR collections.

I base this on the following observations:

I can't understand the format of the timestamp on each logline (the "[2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I run pypy interactively Instead, I count the number of debug lines between start--> and the corresponding <--stop. Most runs have a few hundred lines of output between start/stop All slow runs have very close to 57800 lines out output between start/stop One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations.

Thanks, /Martin

On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw.

maybe you can find out what's that using profiling or valgrind?

On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote:

...
I have tried getting the pypy source and building my own version of pypy. I have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow queries do NOT occur during major_collection_step; at least, I have not observed major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else.

Just to be sure, I have tried running the same application in python with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow.

Cheers, /Martin

On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote:

...
We have hacked up a small sample that seems to exhibit the same issue.

We basically generate a linked list of objects. To increase connectedness, elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list.

We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are traversed every time. To generate some garbage, we build a list holding the traversed elements and a dummy list of characters.

Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we print out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size).

We have tried monitoring memory consumption by external inspection, but cannot consistently verify that memory is deallocated at the same time that we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS?

Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building).

Here is output from a few runs with different number of elements:

pypy mem.py 10000000 start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817

pypy mem.py 20000000 start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502

pypy mem.py 30000000 start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654

Code below -------------------------------------------------------------- import time from random import randint, choice import sys

allElems = {}

class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self]

def set_next(self, l): self.next = l

def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)]))

# if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1

return acc

def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root

num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None)

lst = follow(elem)

total += choice(lst)[0] # use the return value for something

end = time.time()

elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as

...
2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total

On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote:

...
Hi Armin, Maciej

Thanks for responding.

I'm in the process of trying to determine what (if any) of the code I'm in a position to share, and I'll get back to you.

Allowing hinting to the GC would be good. Even better would be a means to allow me to (transparently) allocate objects in unmanaged memory, but I would expect that to be a tall order :)

Thanks, /Martin

Hi Martin.

Note that in case you want us to do the work of isolating the problem, we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :)

Martin Koch

12:48 p.m.

What is the unit? Perhaps I'm being thick here, but I can't correlate it with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49). On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski <fijall@gmail.com>wrote:

...

The number of lines is nonsense. This is a timestamp in hex.

...
Based On Maciej's suggestion, I tried the following

PYPYLOG=- pypy mem.py 10000000 > out

This generates a logfile which looks something like this

start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop

It turns out that the culprit is a lot of MINOR collections.

I base this on the following observations:

I can't understand the format of the timestamp on each logline (the "[2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I run pypy interactively Instead, I count the number of debug lines between start--> and the corresponding <--stop. Most runs have a few hundred lines of output between start/stop All slow runs have very close to 57800 lines out output between start/stop One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations.

Thanks, /Martin

On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw.

maybe you can find out what's that using profiling or valgrind?

On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote:

...
I have tried getting the pypy source and building my own version of pypy. I have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow

queries

...
...
do NOT occur during major_collection_step; at least, I have not observed major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else.

Just to be sure, I have tried running the same application in python with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow.

Cheers, /Martin

On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote:

...
We have hacked up a small sample that seems to exhibit the same

issue.

...
We basically generate a linked list of objects. To increase connectedness, elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list.

We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are traversed every time. To generate some garbage, we build a list holding the traversed elements and a dummy list of characters.

Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we print out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size).

We have tried monitoring memory consumption by external inspection,

but

...
cannot consistently verify that memory is deallocated at the same time that we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS?

Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building).

Here is output from a few runs with different number of elements:

pypy mem.py 10000000 start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817

pypy mem.py 20000000 start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502

pypy mem.py 30000000 start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654

Code below -------------------------------------------------------------- import time from random import randint, choice import sys

allElems = {}

class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self]

def set_next(self, l): self.next = l

def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)]))

# if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1

return acc

def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root

num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None)

lst = follow(elem)

total += choice(lst)[0] # use the return value for something

end = time.time()

elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as

...
2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total

On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski < fijall@gmail.com> wrote:

...
On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote: > Hi Armin, Maciej > > Thanks for responding. > > I'm in the process of trying to determine what (if any) of the

code

...
> I'm > in a > position to share, and I'll get back to you. > > Allowing hinting to the GC would be good. Even better would be a > means > to > allow me to (transparently) allocate objects in unmanaged memory, > but I > would expect that to be a tall order :) > > Thanks, > /Martin

Hi Martin.

Note that in case you want us to do the work of isolating the

On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> wrote: problem,

...
...
...
...
...
we do offer paid support to do that (then we can sign NDAs and stuff). Otherwise we would be more than happy to fix bugs once you isolate a part you can share freely :)

Maciej Fijalkowski

12:53 p.m.

I think it's the cycles of your CPU On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote:

...

What is the unit? Perhaps I'm being thick here, but I can't correlate it with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49).

On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
The number of lines is nonsense. This is a timestamp in hex.

On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> wrote:

...
Based On Maciej's suggestion, I tried the following

PYPYLOG=- pypy mem.py 10000000 > out

This generates a logfile which looks something like this

start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop

It turns out that the culprit is a lot of MINOR collections.

I base this on the following observations:

I can't understand the format of the timestamp on each logline (the "[2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I run pypy interactively Instead, I count the number of debug lines between start--> and the corresponding <--stop. Most runs have a few hundred lines of output between start/stop All slow runs have very close to 57800 lines out output between start/stop One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations.

Thanks, /Martin

On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw.

maybe you can find out what's that using profiling or valgrind?

On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote:

...
I have tried getting the pypy source and building my own version of pypy. I have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow queries do NOT occur during major_collection_step; at least, I have not observed major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else.

Just to be sure, I have tried running the same application in python with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow.

Cheers, /Martin

On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote:

...
We have hacked up a small sample that seems to exhibit the same issue.

We basically generate a linked list of objects. To increase connectedness, elements in the list hold references (dummy_links) to 10 randomly chosen previous elements in the list.

We then time a function that traverses 50000 elements from the list from a random start point. If the traversal reaches the end of the list, we instead traverse one of the dummy links. Thus, exactly 50K elements are traversed every time. To generate some garbage, we build a list holding the traversed elements and a dummy list of characters.

Timings for the last 100 runs are stored in a circular buffer. If the elapsed time for the last run is more than twice the average time, we print out a line with the elapsed time, the threshold, and the 90% runtime (we would like to see that the mean runtime does not increase with the number of elements in the list, but that the max time does increase (linearly with the number of object, i guess); traversing 50K elements should be independent of the memory size).

We have tried monitoring memory consumption by external inspection, but cannot consistently verify that memory is deallocated at the same time that we see slow requests. Perhaps the pypy runtime doesn't always return freed pages back to the OS?

Using top, we observe that 10M elements allocates around 17GB after building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB shortly after building).

Here is output from a few runs with different number of elements:

pypy mem.py 10000000 start build end build 84.142424 that took a long time elapsed: 13.230586 slow_threshold: 1.495401 90th_quantile_runtime: 0.421558 that took a long time elapsed: 13.016531 slow_threshold: 1.488160 90th_quantile_runtime: 0.423441 that took a long time elapsed: 13.032537 slow_threshold: 1.474563 90th_quantile_runtime: 0.419817

pypy mem.py 20000000 start build end build 180.823105 that took a long time elapsed: 27.346064 slow_threshold: 2.295146 90th_quantile_runtime: 0.434726 that took a long time elapsed: 26.028852 slow_threshold: 2.283927 90th_quantile_runtime: 0.374190 that took a long time elapsed: 25.432279 slow_threshold: 2.279631 90th_quantile_runtime: 0.371502

pypy mem.py 30000000 start build end build 276.217811 that took a long time elapsed: 40.993855 slow_threshold: 3.188464 90th_quantile_runtime: 0.459891 that took a long time elapsed: 41.693553 slow_threshold: 3.183003 90th_quantile_runtime: 0.393654 that took a long time elapsed: 39.679769 slow_threshold: 3.190782 90th_quantile_runtime: 0.393677 that took a long time elapsed: 43.573411 slow_threshold: 3.239637 90th_quantile_runtime: 0.393654

Code below -------------------------------------------------------------- import time from random import randint, choice import sys

allElems = {}

class Node: def __init__(self, v_): self.v = v_ self.next = None self.dummy_data = [randint(0,100) for _ in xrange(randint(50,100))] allElems[self.v] = self if self.v > 0: self.dummy_links = [allElems[randint(0, self.v-1)] for _ in xrange(10)] else: self.dummy_links = [self]

def set_next(self, l): self.next = l

def follow(node): acc = [] count = 0 cur = node assert node.v is not None assert cur is not None while count < 50000: # return a value; generate some garbage acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for x in xrange(100)]))

# if we have reached the end, chose a random link cur = choice(cur.dummy_links) if cur.next is None else cur.next count += 1

return acc

def build(num_elems): start = time.time() print "start build" root = Node(0) cur = root for x in xrange(1, num_elems): e = Node(x) cur.next = e cur = e print "end build %f" % (time.time() - start) return root

num_timings = 100 if __name__ == "__main__": num_elems = int(sys.argv[1]) build(num_elems) total = 0 timings = [0.0] * num_timings # run times for the last num_timings runs i = 0 beginning = time.time() while time.time() - beginning < 600: start = time.time() elem = allElems[randint(0, num_elems - 1)] assert(elem is not None)

lst = follow(elem)

total += choice(lst)[0] # use the return value for something

end = time.time()

elapsed = end-start timings[i % num_timings] = elapsed if (i > num_timings): slow_time = 2 * sum(timings)/num_timings # slow defined as > 2*avg run time if (elapsed > slow_time): print "that took a long time elapsed: %f slow_threshold: %f 90th_quantile_runtime: %f" % \ (elapsed, slow_time, sorted(timings)[int(num_timings*.9)]) i += 1 print total

On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski <fijall@gmail.com> wrote: > > On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote: > > Hi Armin, Maciej > > > > Thanks for responding. > > > > I'm in the process of trying to determine what (if any) of the > > code > > I'm > > in a > > position to share, and I'll get back to you. > > > > Allowing hinting to the GC would be good. Even better would be a > > means > > to > > allow me to (transparently) allocate objects in unmanaged memory, > > but I > > would expect that to be a tall order :) > > > > Thanks, > > /Martin > > Hi Martin. > > Note that in case you want us to do the work of isolating the > problem, > we do offer paid support to do that (then we can sign NDAs and > stuff). > Otherwise we would be more than happy to fix bugs once you isolate > a > part you can share freely :)

Martin Koch

1:18 p.m.

Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow queries then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ? Thanks, /Martin On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski <fijall@gmail.com>wrote:

...

I think it's the cycles of your CPU

...
What is the unit? Perhaps I'm being thick here, but I can't correlate it with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49).

On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
The number of lines is nonsense. This is a timestamp in hex.

On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> wrote:

...
Based On Maciej's suggestion, I tried the following

PYPYLOG=- pypy mem.py 10000000 > out

This generates a logfile which looks something like this

start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop

It turns out that the culprit is a lot of MINOR collections.

I base this on the following observations:

I can't understand the format of the timestamp on each logline (the "[2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I

run

...
...
pypy interactively Instead, I count the number of debug lines between start--> and the corresponding <--stop. Most runs have a few hundred lines of output between start/stop All slow runs have very close to 57800 lines out output between start/stop One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations.

Thanks, /Martin

On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski <fijall@gmail.com

...
...
wrote:

...
there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw.

maybe you can find out what's that using profiling or valgrind?

On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote:

...
I have tried getting the pypy source and building my own version of pypy. I have modified rpython/memory/gc/incminimark.py:major_collection_step() to print out when it starts and when it stops. Apparently, the slow queries do NOT occur during major_collection_step; at least, I have not

observed

...
...
major step output during a query execution. So, apparently, something else is blocking. This could be another aspect of the GC, but it could also be anything else.

Just to be sure, I have tried running the same application in

...
...
...
...
...
with garbage collection disabled. I don't see the problem there, so it is somehow related to either GC or the runtime somehow.

Cheers, /Martin

On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: > > We have hacked up a small sample that seems to exhibit the same > issue. > > We basically generate a linked list of objects. To increase > connectedness, > elements in the list hold references (dummy_links) to 10 randomly > chosen > previous elements in the list. > > We then time a function that traverses 50000 elements from the

On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote: python list

...
...
...
...
...
> from a > random start point. If the traversal reaches the end of the list, we > instead > traverse one of the dummy links. Thus, exactly 50K elements are > traversed > every time. To generate some garbage, we build a list holding the > traversed > elements and a dummy list of characters. > > Timings for the last 100 runs are stored in a circular buffer. If > the > elapsed time for the last run is more than twice the average time, > we > print > out a line with the elapsed time, the threshold, and the 90% runtime > (we > would like to see that the mean runtime does not increase with the > number of > elements in the list, but that the max time does increase (linearly > with the > number of object, i guess); traversing 50K elements should be > independent of > the memory size). > > We have tried monitoring memory consumption by external inspection, > but > cannot consistently verify that memory is deallocated at the same > time > that > we see slow requests. Perhaps the pypy runtime doesn't always return > freed > pages back to the OS? > > Using top, we observe that 10M elements allocates around 17GB after > building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB > shortly > after building). > > Here is output from a few runs with different number of elements: > > > pypy mem.py 10000000 > start build > end build 84.142424 > that took a long time elapsed: 13.230586 slow_threshold: 1.495401 > 90th_quantile_runtime: 0.421558 > that took a long time elapsed: 13.016531 slow_threshold: 1.488160 > 90th_quantile_runtime: 0.423441 > that took a long time elapsed: 13.032537 slow_threshold: 1.474563 > 90th_quantile_runtime: 0.419817 > > pypy mem.py 20000000 > start build > end build 180.823105 > that took a long time elapsed: 27.346064 slow_threshold: 2.295146 > 90th_quantile_runtime: 0.434726 > that took a long time elapsed: 26.028852 slow_threshold: 2.283927 > 90th_quantile_runtime: 0.374190 > that took a long time elapsed: 25.432279 slow_threshold: 2.279631 > 90th_quantile_runtime: 0.371502 > > pypy mem.py 30000000 > start build > end build 276.217811 > that took a long time elapsed: 40.993855 slow_threshold: 3.188464 > 90th_quantile_runtime: 0.459891 > that took a long time elapsed: 41.693553 slow_threshold: 3.183003 > 90th_quantile_runtime: 0.393654 > that took a long time elapsed: 39.679769 slow_threshold: 3.190782 > 90th_quantile_runtime: 0.393677 > that took a long time elapsed: 43.573411 slow_threshold: 3.239637 > 90th_quantile_runtime: 0.393654 > > Code below > -------------------------------------------------------------- > import time > from random import randint, choice > import sys > > > allElems = {} > > class Node: > def __init__(self, v_): > self.v = v_ > self.next = None > self.dummy_data = [randint(0,100) > for _ in xrange(randint(50,100))] > allElems[self.v] = self > if self.v > 0: > self.dummy_links = [allElems[randint(0, self.v-1)] for _ > in > xrange(10)] > else: > self.dummy_links = [self] > > def set_next(self, l): > self.next = l > > > def follow(node): > acc = [] > count = 0 > cur = node > assert node.v is not None > assert cur is not None > while count < 50000: > # return a value; generate some garbage > acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") for > x > in > xrange(100)])) > > # if we have reached the end, chose a random link > cur = choice(cur.dummy_links) if cur.next is None else > cur.next > count += 1 > > return acc > > > def build(num_elems): > start = time.time() > print "start build" > root = Node(0) > cur = root > for x in xrange(1, num_elems): > e = Node(x) > cur.next = e > cur = e > print "end build %f" % (time.time() - start) > return root > > > num_timings = 100 > if __name__ == "__main__": > num_elems = int(sys.argv[1]) > build(num_elems) > total = 0 > timings = [0.0] * num_timings # run times for the last > num_timings > runs > i = 0 > beginning = time.time() > while time.time() - beginning < 600: > start = time.time() > elem = allElems[randint(0, num_elems - 1)] > assert(elem is not None) > > lst = follow(elem) > > total += choice(lst)[0] # use the return value for something > > end = time.time() > > elapsed = end-start > timings[i % num_timings] = elapsed > if (i > num_timings): > slow_time = 2 * sum(timings)/num_timings # slow defined > as > > > 2*avg run time > if (elapsed > slow_time): > print "that took a long time elapsed: %f > slow_threshold: > %f 90th_quantile_runtime: %f" % \ > (elapsed, slow_time, > sorted(timings)[int(num_timings*.9)]) > i += 1 > print total > > > > > > On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > <fijall@gmail.com> > wrote: >> >> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> wrote: >> > Hi Armin, Maciej >> > >> > Thanks for responding. >> > >> > I'm in the process of trying to determine what (if any) of the >> > code >> > I'm >> > in a >> > position to share, and I'll get back to you. >> > >> > Allowing hinting to the GC would be good. Even better would be a >> > means >> > to >> > allow me to (transparently) allocate objects in unmanaged memory, >> > but I >> > would expect that to be a tall order :) >> > >> > Thanks, >> > /Martin >> >> Hi Martin. >> >> Note that in case you want us to do the work of isolating the >> problem, >> we do offer paid support to do that (then we can sign NDAs and >> stuff). >> Otherwise we would be more than happy to fix bugs once you isolate >> a >> part you can share freely :) > >

Maciej Fijalkowski

1:20 p.m.

are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things: a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those and plot them On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote:

...

Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow queries then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.

So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ?

Thanks, /Martin

On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
I think it's the cycles of your CPU

On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote:

...
What is the unit? Perhaps I'm being thick here, but I can't correlate it with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49).

On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
The number of lines is nonsense. This is a timestamp in hex.

On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> wrote:

...
Based On Maciej's suggestion, I tried the following

PYPYLOG=- pypy mem.py 10000000 > out

This generates a logfile which looks something like this

start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop

It turns out that the culprit is a lot of MINOR collections.

I base this on the following observations:

I can't understand the format of the timestamp on each logline (the "[2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I run pypy interactively Instead, I count the number of debug lines between start--> and the corresponding <--stop. Most runs have a few hundred lines of output between start/stop All slow runs have very close to 57800 lines out output between start/stop One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations.

Thanks, /Martin

On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
there is an environment variable PYPYLOG=gc:- (where - is stdout) which will do that for you btw.

maybe you can find out what's that using profiling or valgrind?

On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote: > I have tried getting the pypy source and building my own version > of > pypy. I > have modified > rpython/memory/gc/incminimark.py:major_collection_step() > to > print out when it starts and when it stops. Apparently, the slow > queries > do > NOT occur during major_collection_step; at least, I have not > observed > major > step output during a query execution. So, apparently, something > else > is > blocking. This could be another aspect of the GC, but it could > also > be > anything else. > > Just to be sure, I have tried running the same application in > python > with > garbage collection disabled. I don't see the problem there, so it > is > somehow > related to either GC or the runtime somehow. > > Cheers, > /Martin > > > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> > wrote: >> >> We have hacked up a small sample that seems to exhibit the same >> issue. >> >> We basically generate a linked list of objects. To increase >> connectedness, >> elements in the list hold references (dummy_links) to 10 randomly >> chosen >> previous elements in the list. >> >> We then time a function that traverses 50000 elements from the >> list >> from a >> random start point. If the traversal reaches the end of the list, >> we >> instead >> traverse one of the dummy links. Thus, exactly 50K elements are >> traversed >> every time. To generate some garbage, we build a list holding the >> traversed >> elements and a dummy list of characters. >> >> Timings for the last 100 runs are stored in a circular buffer. If >> the >> elapsed time for the last run is more than twice the average >> time, >> we >> print >> out a line with the elapsed time, the threshold, and the 90% >> runtime >> (we >> would like to see that the mean runtime does not increase with >> the >> number of >> elements in the list, but that the max time does increase >> (linearly >> with the >> number of object, i guess); traversing 50K elements should be >> independent of >> the memory size). >> >> We have tried monitoring memory consumption by external >> inspection, >> but >> cannot consistently verify that memory is deallocated at the same >> time >> that >> we see slow requests. Perhaps the pypy runtime doesn't always >> return >> freed >> pages back to the OS? >> >> Using top, we observe that 10M elements allocates around 17GB >> after >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >> shortly >> after building). >> >> Here is output from a few runs with different number of elements: >> >> >> pypy mem.py 10000000 >> start build >> end build 84.142424 >> that took a long time elapsed: 13.230586 slow_threshold: >> 1.495401 >> 90th_quantile_runtime: 0.421558 >> that took a long time elapsed: 13.016531 slow_threshold: >> 1.488160 >> 90th_quantile_runtime: 0.423441 >> that took a long time elapsed: 13.032537 slow_threshold: >> 1.474563 >> 90th_quantile_runtime: 0.419817 >> >> pypy mem.py 20000000 >> start build >> end build 180.823105 >> that took a long time elapsed: 27.346064 slow_threshold: >> 2.295146 >> 90th_quantile_runtime: 0.434726 >> that took a long time elapsed: 26.028852 slow_threshold: >> 2.283927 >> 90th_quantile_runtime: 0.374190 >> that took a long time elapsed: 25.432279 slow_threshold: >> 2.279631 >> 90th_quantile_runtime: 0.371502 >> >> pypy mem.py 30000000 >> start build >> end build 276.217811 >> that took a long time elapsed: 40.993855 slow_threshold: >> 3.188464 >> 90th_quantile_runtime: 0.459891 >> that took a long time elapsed: 41.693553 slow_threshold: >> 3.183003 >> 90th_quantile_runtime: 0.393654 >> that took a long time elapsed: 39.679769 slow_threshold: >> 3.190782 >> 90th_quantile_runtime: 0.393677 >> that took a long time elapsed: 43.573411 slow_threshold: >> 3.239637 >> 90th_quantile_runtime: 0.393654 >> >> Code below >> -------------------------------------------------------------- >> import time >> from random import randint, choice >> import sys >> >> >> allElems = {} >> >> class Node: >> def __init__(self, v_): >> self.v = v_ >> self.next = None >> self.dummy_data = [randint(0,100) >> for _ in xrange(randint(50,100))] >> allElems[self.v] = self >> if self.v > 0: >> self.dummy_links = [allElems[randint(0, self.v-1)] >> for _ >> in >> xrange(10)] >> else: >> self.dummy_links = [self] >> >> def set_next(self, l): >> self.next = l >> >> >> def follow(node): >> acc = [] >> count = 0 >> cur = node >> assert node.v is not None >> assert cur is not None >> while count < 50000: >> # return a value; generate some garbage >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") >> for >> x >> in >> xrange(100)])) >> >> # if we have reached the end, chose a random link >> cur = choice(cur.dummy_links) if cur.next is None else >> cur.next >> count += 1 >> >> return acc >> >> >> def build(num_elems): >> start = time.time() >> print "start build" >> root = Node(0) >> cur = root >> for x in xrange(1, num_elems): >> e = Node(x) >> cur.next = e >> cur = e >> print "end build %f" % (time.time() - start) >> return root >> >> >> num_timings = 100 >> if __name__ == "__main__": >> num_elems = int(sys.argv[1]) >> build(num_elems) >> total = 0 >> timings = [0.0] * num_timings # run times for the last >> num_timings >> runs >> i = 0 >> beginning = time.time() >> while time.time() - beginning < 600: >> start = time.time() >> elem = allElems[randint(0, num_elems - 1)] >> assert(elem is not None) >> >> lst = follow(elem) >> >> total += choice(lst)[0] # use the return value for >> something >> >> end = time.time() >> >> elapsed = end-start >> timings[i % num_timings] = elapsed >> if (i > num_timings): >> slow_time = 2 * sum(timings)/num_timings # slow >> defined >> as >> > >> 2*avg run time >> if (elapsed > slow_time): >> print "that took a long time elapsed: %f >> slow_threshold: >> %f 90th_quantile_runtime: %f" % \ >> (elapsed, slow_time, >> sorted(timings)[int(num_timings*.9)]) >> i += 1 >> print total >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> <fijall@gmail.com> >> wrote: >>> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> >>> wrote: >>> > Hi Armin, Maciej >>> > >>> > Thanks for responding. >>> > >>> > I'm in the process of trying to determine what (if any) of the >>> > code >>> > I'm >>> > in a >>> > position to share, and I'll get back to you. >>> > >>> > Allowing hinting to the GC would be good. Even better would be >>> > a >>> > means >>> > to >>> > allow me to (transparently) allocate objects in unmanaged >>> > memory, >>> > but I >>> > would expect that to be a tall order :) >>> > >>> > Thanks, >>> > /Martin >>> >>> Hi Martin. >>> >>> Note that in case you want us to do the work of isolating the >>> problem, >>> we do offer paid support to do that (then we can sign NDAs and >>> stuff). >>> Otherwise we would be more than happy to fix bugs once you >>> isolate >>> a >>> part you can share freely :) >> >> >

Maciej Fijalkowski

1:23 p.m.

On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...

are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things:

a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those

and plot them

^^^ or just paste the results actually

...

On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote:

...
Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow queries then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.

So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ?

Thanks, /Martin

On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
I think it's the cycles of your CPU

On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote:

...
What is the unit? Perhaps I'm being thick here, but I can't correlate it with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49).

On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
The number of lines is nonsense. This is a timestamp in hex.

On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> wrote:

...
Based On Maciej's suggestion, I tried the following

PYPYLOG=- pypy mem.py 10000000 > out

This generates a logfile which looks something like this

start--> [2b99f1981b527e] {gc-minor [2b99f1981ba680] {gc-minor-walkroots [2b99f1981c2e02] gc-minor-walkroots} [2b99f19890d750] gc-minor} [snip] ... <--stop

It turns out that the culprit is a lot of MINOR collections.

I base this on the following observations:

I can't understand the format of the timestamp on each logline (the "[2b99f1981b527e]"). From what I can see in the code, this should be output from time.clock(), but that doesn't return a number like that when I run pypy interactively Instead, I count the number of debug lines between start--> and the corresponding <--stop. Most runs have a few hundred lines of output between start/stop All slow runs have very close to 57800 lines out output between start/stop One such sample does 9609 gc-collect-step operations, 9647 gc-minor operations, and 9647 gc-minor-walkroots operations.

Thanks, /Martin

On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski <fijall@gmail.com> wrote: > > there is an environment variable PYPYLOG=gc:- (where - is stdout) > which will do that for you btw. > > maybe you can find out what's that using profiling or valgrind? > > On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote: > > I have tried getting the pypy source and building my own version > > of > > pypy. I > > have modified > > rpython/memory/gc/incminimark.py:major_collection_step() > > to > > print out when it starts and when it stops. Apparently, the slow > > queries > > do > > NOT occur during major_collection_step; at least, I have not > > observed > > major > > step output during a query execution. So, apparently, something > > else > > is > > blocking. This could be another aspect of the GC, but it could > > also > > be > > anything else. > > > > Just to be sure, I have tried running the same application in > > python > > with > > garbage collection disabled. I don't see the problem there, so it > > is > > somehow > > related to either GC or the runtime somehow. > > > > Cheers, > > /Martin > > > > > > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> > > wrote: > >> > >> We have hacked up a small sample that seems to exhibit the same > >> issue. > >> > >> We basically generate a linked list of objects. To increase > >> connectedness, > >> elements in the list hold references (dummy_links) to 10 randomly > >> chosen > >> previous elements in the list. > >> > >> We then time a function that traverses 50000 elements from the > >> list > >> from a > >> random start point. If the traversal reaches the end of the list, > >> we > >> instead > >> traverse one of the dummy links. Thus, exactly 50K elements are > >> traversed > >> every time. To generate some garbage, we build a list holding the > >> traversed > >> elements and a dummy list of characters. > >> > >> Timings for the last 100 runs are stored in a circular buffer. If > >> the > >> elapsed time for the last run is more than twice the average > >> time, > >> we > >> print > >> out a line with the elapsed time, the threshold, and the 90% > >> runtime > >> (we > >> would like to see that the mean runtime does not increase with > >> the > >> number of > >> elements in the list, but that the max time does increase > >> (linearly > >> with the > >> number of object, i guess); traversing 50K elements should be > >> independent of > >> the memory size). > >> > >> We have tried monitoring memory consumption by external > >> inspection, > >> but > >> cannot consistently verify that memory is deallocated at the same > >> time > >> that > >> we see slow requests. Perhaps the pypy runtime doesn't always > >> return > >> freed > >> pages back to the OS? > >> > >> Using top, we observe that 10M elements allocates around 17GB > >> after > >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB > >> shortly > >> after building). > >> > >> Here is output from a few runs with different number of elements: > >> > >> > >> pypy mem.py 10000000 > >> start build > >> end build 84.142424 > >> that took a long time elapsed: 13.230586 slow_threshold: > >> 1.495401 > >> 90th_quantile_runtime: 0.421558 > >> that took a long time elapsed: 13.016531 slow_threshold: > >> 1.488160 > >> 90th_quantile_runtime: 0.423441 > >> that took a long time elapsed: 13.032537 slow_threshold: > >> 1.474563 > >> 90th_quantile_runtime: 0.419817 > >> > >> pypy mem.py 20000000 > >> start build > >> end build 180.823105 > >> that took a long time elapsed: 27.346064 slow_threshold: > >> 2.295146 > >> 90th_quantile_runtime: 0.434726 > >> that took a long time elapsed: 26.028852 slow_threshold: > >> 2.283927 > >> 90th_quantile_runtime: 0.374190 > >> that took a long time elapsed: 25.432279 slow_threshold: > >> 2.279631 > >> 90th_quantile_runtime: 0.371502 > >> > >> pypy mem.py 30000000 > >> start build > >> end build 276.217811 > >> that took a long time elapsed: 40.993855 slow_threshold: > >> 3.188464 > >> 90th_quantile_runtime: 0.459891 > >> that took a long time elapsed: 41.693553 slow_threshold: > >> 3.183003 > >> 90th_quantile_runtime: 0.393654 > >> that took a long time elapsed: 39.679769 slow_threshold: > >> 3.190782 > >> 90th_quantile_runtime: 0.393677 > >> that took a long time elapsed: 43.573411 slow_threshold: > >> 3.239637 > >> 90th_quantile_runtime: 0.393654 > >> > >> Code below > >> -------------------------------------------------------------- > >> import time > >> from random import randint, choice > >> import sys > >> > >> > >> allElems = {} > >> > >> class Node: > >> def __init__(self, v_): > >> self.v = v_ > >> self.next = None > >> self.dummy_data = [randint(0,100) > >> for _ in xrange(randint(50,100))] > >> allElems[self.v] = self > >> if self.v > 0: > >> self.dummy_links = [allElems[randint(0, self.v-1)] > >> for _ > >> in > >> xrange(10)] > >> else: > >> self.dummy_links = [self] > >> > >> def set_next(self, l): > >> self.next = l > >> > >> > >> def follow(node): > >> acc = [] > >> count = 0 > >> cur = node > >> assert node.v is not None > >> assert cur is not None > >> while count < 50000: > >> # return a value; generate some garbage > >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") > >> for > >> x > >> in > >> xrange(100)])) > >> > >> # if we have reached the end, chose a random link > >> cur = choice(cur.dummy_links) if cur.next is None else > >> cur.next > >> count += 1 > >> > >> return acc > >> > >> > >> def build(num_elems): > >> start = time.time() > >> print "start build" > >> root = Node(0) > >> cur = root > >> for x in xrange(1, num_elems): > >> e = Node(x) > >> cur.next = e > >> cur = e > >> print "end build %f" % (time.time() - start) > >> return root > >> > >> > >> num_timings = 100 > >> if __name__ == "__main__": > >> num_elems = int(sys.argv[1]) > >> build(num_elems) > >> total = 0 > >> timings = [0.0] * num_timings # run times for the last > >> num_timings > >> runs > >> i = 0 > >> beginning = time.time() > >> while time.time() - beginning < 600: > >> start = time.time() > >> elem = allElems[randint(0, num_elems - 1)] > >> assert(elem is not None) > >> > >> lst = follow(elem) > >> > >> total += choice(lst)[0] # use the return value for > >> something > >> > >> end = time.time() > >> > >> elapsed = end-start > >> timings[i % num_timings] = elapsed > >> if (i > num_timings): > >> slow_time = 2 * sum(timings)/num_timings # slow > >> defined > >> as > >> > > >> 2*avg run time > >> if (elapsed > slow_time): > >> print "that took a long time elapsed: %f > >> slow_threshold: > >> %f 90th_quantile_runtime: %f" % \ > >> (elapsed, slow_time, > >> sorted(timings)[int(num_timings*.9)]) > >> i += 1 > >> print total > >> > >> > >> > >> > >> > >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >> <fijall@gmail.com> > >> wrote: > >>> > >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> > >>> wrote: > >>> > Hi Armin, Maciej > >>> > > >>> > Thanks for responding. > >>> > > >>> > I'm in the process of trying to determine what (if any) of the > >>> > code > >>> > I'm > >>> > in a > >>> > position to share, and I'll get back to you. > >>> > > >>> > Allowing hinting to the GC would be good. Even better would be > >>> > a > >>> > means > >>> > to > >>> > allow me to (transparently) allocate objects in unmanaged > >>> > memory, > >>> > but I > >>> > would expect that to be a tall order :) > >>> > > >>> > Thanks, > >>> > /Martin > >>> > >>> Hi Martin. > >>> > >>> Note that in case you want us to do the work of isolating the > >>> problem, > >>> we do offer paid support to do that (then we can sign NDAs and > >>> stuff). > >>> Otherwise we would be more than happy to fix bugs once you > >>> isolate > >>> a > >>> part you can share freely :) > >> > >> > >

Martin Koch

2:19 p.m.

Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor). Please let me know if this is unclear :) 668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 *slow*:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 *slow*:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 *slow*:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104 Thanks, /Martin On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com>wrote:

...

On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things:

a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those

and plot them

^^^ or just paste the results actually

...
On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote:

...
Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow queries

...
...
all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.

So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ?

Thanks, /Martin

On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
I think it's the cycles of your CPU

On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote:

...
What is the unit? Perhaps I'm being thick here, but I can't

correlate it

...
...
with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49).

On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski < fijall@gmail.com> wrote:

...
The number of lines is nonsense. This is a timestamp in hex.

On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com>

wrote:

...
> Based On Maciej's suggestion, I tried the following > > PYPYLOG=- pypy mem.py 10000000 > out > > This generates a logfile which looks something like this > > start--> > [2b99f1981b527e] {gc-minor > [2b99f1981ba680] {gc-minor-walkroots > [2b99f1981c2e02] gc-minor-walkroots} > [2b99f19890d750] gc-minor} > [snip] > ... > <--stop > > > It turns out that the culprit is a lot of MINOR collections. > > I base this on the following observations: > > I can't understand the format of the timestamp on each logline (the > "[2b99f1981b527e]"). From what I can see in the code, this should be > output > from time.clock(), but that doesn't return a number like that when I > run > pypy interactively > Instead, I count the number of debug lines between start--> and

...
...
...
...
...
> corresponding <--stop. > Most runs have a few hundred lines of output between start/stop > All slow runs have very close to 57800 lines out output between > start/stop > One such sample does 9609 gc-collect-step operations, 9647 gc-minor > operations, and 9647 gc-minor-walkroots operations. > > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > <fijall@gmail.com> > wrote: >> >> there is an environment variable PYPYLOG=gc:- (where - is stdout) >> which will do that for you btw. >> >> maybe you can find out what's that using profiling or valgrind? >> >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> wrote: >> > I have tried getting the pypy source and building my own version >> > of >> > pypy. I >> > have modified >> > rpython/memory/gc/incminimark.py:major_collection_step() >> > to >> > print out when it starts and when it stops. Apparently, the slow >> > queries >> > do >> > NOT occur during major_collection_step; at least, I have not >> > observed >> > major >> > step output during a query execution. So, apparently, something >> > else >> > is >> > blocking. This could be another aspect of the GC, but it could >> > also >> > be >> > anything else. >> > >> > Just to be sure, I have tried running the same application in >> > python >> > with >> > garbage collection disabled. I don't see the problem there, so it >> > is >> > somehow >> > related to either GC or the runtime somehow. >> > >> > Cheers, >> > /Martin >> > >> > >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >> > wrote: >> >> >> >> We have hacked up a small sample that seems to exhibit the same >> >> issue. >> >> >> >> We basically generate a linked list of objects. To increase >> >> connectedness, >> >> elements in the list hold references (dummy_links) to 10 randomly >> >> chosen >> >> previous elements in the list. >> >> >> >> We then time a function that traverses 50000 elements from the >> >> list >> >> from a >> >> random start point. If the traversal reaches the end of the

...
...
...
...
...
>> >> we >> >> instead >> >> traverse one of the dummy links. Thus, exactly 50K elements are >> >> traversed >> >> every time. To generate some garbage, we build a list holding

...
...
...
...
...
>> >> traversed >> >> elements and a dummy list of characters. >> >> >> >> Timings for the last 100 runs are stored in a circular buffer. If >> >> the >> >> elapsed time for the last run is more than twice the average >> >> time, >> >> we >> >> print >> >> out a line with the elapsed time, the threshold, and the 90% >> >> runtime >> >> (we >> >> would like to see that the mean runtime does not increase with >> >> the >> >> number of >> >> elements in the list, but that the max time does increase >> >> (linearly >> >> with the >> >> number of object, i guess); traversing 50K elements should be >> >> independent of >> >> the memory size). >> >> >> >> We have tried monitoring memory consumption by external >> >> inspection, >> >> but >> >> cannot consistently verify that memory is deallocated at the same >> >> time >> >> that >> >> we see slow requests. Perhaps the pypy runtime doesn't always >> >> return >> >> freed >> >> pages back to the OS? >> >> >> >> Using top, we observe that 10M elements allocates around 17GB >> >> after >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to 35GB >> >> shortly >> >> after building). >> >> >> >> Here is output from a few runs with different number of elements: >> >> >> >> >> >> pypy mem.py 10000000 >> >> start build >> >> end build 84.142424 >> >> that took a long time elapsed: 13.230586 slow_threshold: >> >> 1.495401 >> >> 90th_quantile_runtime: 0.421558 >> >> that took a long time elapsed: 13.016531 slow_threshold: >> >> 1.488160 >> >> 90th_quantile_runtime: 0.423441 >> >> that took a long time elapsed: 13.032537 slow_threshold: >> >> 1.474563 >> >> 90th_quantile_runtime: 0.419817 >> >> >> >> pypy mem.py 20000000 >> >> start build >> >> end build 180.823105 >> >> that took a long time elapsed: 27.346064 slow_threshold: >> >> 2.295146 >> >> 90th_quantile_runtime: 0.434726 >> >> that took a long time elapsed: 26.028852 slow_threshold: >> >> 2.283927 >> >> 90th_quantile_runtime: 0.374190 >> >> that took a long time elapsed: 25.432279 slow_threshold: >> >> 2.279631 >> >> 90th_quantile_runtime: 0.371502 >> >> >> >> pypy mem.py 30000000 >> >> start build >> >> end build 276.217811 >> >> that took a long time elapsed: 40.993855 slow_threshold: >> >> 3.188464 >> >> 90th_quantile_runtime: 0.459891 >> >> that took a long time elapsed: 41.693553 slow_threshold: >> >> 3.183003 >> >> 90th_quantile_runtime: 0.393654 >> >> that took a long time elapsed: 39.679769 slow_threshold: >> >> 3.190782 >> >> 90th_quantile_runtime: 0.393677 >> >> that took a long time elapsed: 43.573411 slow_threshold: >> >> 3.239637 >> >> 90th_quantile_runtime: 0.393654 >> >> >> >> Code below >> >> -------------------------------------------------------------- >> >> import time >> >> from random import randint, choice >> >> import sys >> >> >> >> >> >> allElems = {} >> >> >> >> class Node: >> >> def __init__(self, v_): >> >> self.v = v_ >> >> self.next = None >> >> self.dummy_data = [randint(0,100) >> >> for _ in xrange(randint(50,100))] >> >> allElems[self.v] = self >> >> if self.v > 0: >> >> self.dummy_links = [allElems[randint(0, self.v-1)] >> >> for _ >> >> in >> >> xrange(10)] >> >> else: >> >> self.dummy_links = [self] >> >> >> >> def set_next(self, l): >> >> self.next = l >> >> >> >> >> >> def follow(node): >> >> acc = [] >> >> count = 0 >> >> cur = node >> >> assert node.v is not None >> >> assert cur is not None >> >> while count < 50000: >> >> # return a value; generate some garbage >> >> acc.append((cur.v, [choice("abcdefghijklmnopqrstuvwxyz") >> >> for >> >> x >> >> in >> >> xrange(100)])) >> >> >> >> # if we have reached the end, chose a random link >> >> cur = choice(cur.dummy_links) if cur.next is None else >> >> cur.next >> >> count += 1 >> >> >> >> return acc >> >> >> >> >> >> def build(num_elems): >> >> start = time.time() >> >> print "start build" >> >> root = Node(0) >> >> cur = root >> >> for x in xrange(1, num_elems): >> >> e = Node(x) >> >> cur.next = e >> >> cur = e >> >> print "end build %f" % (time.time() - start) >> >> return root >> >> >> >> >> >> num_timings = 100 >> >> if __name__ == "__main__": >> >> num_elems = int(sys.argv[1]) >> >> build(num_elems) >> >> total = 0 >> >> timings = [0.0] * num_timings # run times for the last >> >> num_timings >> >> runs >> >> i = 0 >> >> beginning = time.time() >> >> while time.time() - beginning < 600: >> >> start = time.time() >> >> elem = allElems[randint(0, num_elems - 1)] >> >> assert(elem is not None) >> >> >> >> lst = follow(elem) >> >> >> >> total += choice(lst)[0] # use the return value for >> >> something >> >> >> >> end = time.time() >> >> >> >> elapsed = end-start >> >> timings[i % num_timings] = elapsed >> >> if (i > num_timings): >> >> slow_time = 2 * sum(timings)/num_timings # slow >> >> defined >> >> as >> >> > >> >> 2*avg run time >> >> if (elapsed > slow_time): >> >> print "that took a long time elapsed: %f >> >> slow_threshold: >> >> %f 90th_quantile_runtime: %f" % \ >> >> (elapsed, slow_time, >> >> sorted(timings)[int(num_timings*.9)]) >> >> i += 1 >> >> print total >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >> >> <fijall@gmail.com> >> >> wrote: >> >>> >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> >> >>> wrote: >> >>> > Hi Armin, Maciej >> >>> > >> >>> > Thanks for responding. >> >>> > >> >>> > I'm in the process of trying to determine what (if any) of

then the list, the the

...
...
...
...
...
>> >>> > code >> >>> > I'm >> >>> > in a >> >>> > position to share, and I'll get back to you. >> >>> > >> >>> > Allowing hinting to the GC would be good. Even better would be >> >>> > a >> >>> > means >> >>> > to >> >>> > allow me to (transparently) allocate objects in unmanaged >> >>> > memory, >> >>> > but I >> >>> > would expect that to be a tall order :) >> >>> > >> >>> > Thanks, >> >>> > /Martin >> >>> >> >>> Hi Martin. >> >>> >> >>> Note that in case you want us to do the work of isolating the >> >>> problem, >> >>> we do offer paid support to do that (then we can sign NDAs and >> >>> stuff). >> >>> Otherwise we would be more than happy to fix bugs once you >> >>> isolate >> >>> a >> >>> part you can share freely :) >> >> >> >> >> > > >

Maciej Fijalkowski

2:21 p.m.

eh, this is not what I need I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step) On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote:

...

Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor).

Please let me know if this is unclear :)

668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104

Thanks, /Martin

On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things:

a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those

and plot them

^^^ or just paste the results actually

...
On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote:

...
Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow queries then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.

So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ?

Thanks, /Martin

On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
I think it's the cycles of your CPU

On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote:

...
What is the unit? Perhaps I'm being thick here, but I can't correlate it with seconds (which the program does print out). Slow runs are around 13 seconds, but are around 34*10^9(dec), 0x800000000 timestamp units (e.g. from 0x2b994c9d31889c to 0x2b9944ab8c4f49).

On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski <fijall@gmail.com> wrote: > > The number of lines is nonsense. This is a timestamp in hex. > > On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> > wrote: > > Based On Maciej's suggestion, I tried the following > > > > PYPYLOG=- pypy mem.py 10000000 > out > > > > This generates a logfile which looks something like this > > > > start--> > > [2b99f1981b527e] {gc-minor > > [2b99f1981ba680] {gc-minor-walkroots > > [2b99f1981c2e02] gc-minor-walkroots} > > [2b99f19890d750] gc-minor} > > [snip] > > ... > > <--stop > > > > > > It turns out that the culprit is a lot of MINOR collections. > > > > I base this on the following observations: > > > > I can't understand the format of the timestamp on each logline > > (the > > "[2b99f1981b527e]"). From what I can see in the code, this should > > be > > output > > from time.clock(), but that doesn't return a number like that > > when I > > run > > pypy interactively > > Instead, I count the number of debug lines between start--> and > > the > > corresponding <--stop. > > Most runs have a few hundred lines of output between start/stop > > All slow runs have very close to 57800 lines out output between > > start/stop > > One such sample does 9609 gc-collect-step operations, 9647 > > gc-minor > > operations, and 9647 gc-minor-walkroots operations. > > > > > > Thanks, > > /Martin > > > > > > On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski > > <fijall@gmail.com> > > wrote: > >> > >> there is an environment variable PYPYLOG=gc:- (where - is > >> stdout) > >> which will do that for you btw. > >> > >> maybe you can find out what's that using profiling or valgrind? > >> > >> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> > >> wrote: > >> > I have tried getting the pypy source and building my own > >> > version > >> > of > >> > pypy. I > >> > have modified > >> > rpython/memory/gc/incminimark.py:major_collection_step() > >> > to > >> > print out when it starts and when it stops. Apparently, the > >> > slow > >> > queries > >> > do > >> > NOT occur during major_collection_step; at least, I have not > >> > observed > >> > major > >> > step output during a query execution. So, apparently, > >> > something > >> > else > >> > is > >> > blocking. This could be another aspect of the GC, but it could > >> > also > >> > be > >> > anything else. > >> > > >> > Just to be sure, I have tried running the same application in > >> > python > >> > with > >> > garbage collection disabled. I don't see the problem there, so > >> > it > >> > is > >> > somehow > >> > related to either GC or the runtime somehow. > >> > > >> > Cheers, > >> > /Martin > >> > > >> > > >> > On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> > >> > wrote: > >> >> > >> >> We have hacked up a small sample that seems to exhibit the > >> >> same > >> >> issue. > >> >> > >> >> We basically generate a linked list of objects. To increase > >> >> connectedness, > >> >> elements in the list hold references (dummy_links) to 10 > >> >> randomly > >> >> chosen > >> >> previous elements in the list. > >> >> > >> >> We then time a function that traverses 50000 elements from > >> >> the > >> >> list > >> >> from a > >> >> random start point. If the traversal reaches the end of the > >> >> list, > >> >> we > >> >> instead > >> >> traverse one of the dummy links. Thus, exactly 50K elements > >> >> are > >> >> traversed > >> >> every time. To generate some garbage, we build a list holding > >> >> the > >> >> traversed > >> >> elements and a dummy list of characters. > >> >> > >> >> Timings for the last 100 runs are stored in a circular > >> >> buffer. If > >> >> the > >> >> elapsed time for the last run is more than twice the average > >> >> time, > >> >> we > >> >> print > >> >> out a line with the elapsed time, the threshold, and the 90% > >> >> runtime > >> >> (we > >> >> would like to see that the mean runtime does not increase > >> >> with > >> >> the > >> >> number of > >> >> elements in the list, but that the max time does increase > >> >> (linearly > >> >> with the > >> >> number of object, i guess); traversing 50K elements should be > >> >> independent of > >> >> the memory size). > >> >> > >> >> We have tried monitoring memory consumption by external > >> >> inspection, > >> >> but > >> >> cannot consistently verify that memory is deallocated at the > >> >> same > >> >> time > >> >> that > >> >> we see slow requests. Perhaps the pypy runtime doesn't always > >> >> return > >> >> freed > >> >> pages back to the OS? > >> >> > >> >> Using top, we observe that 10M elements allocates around 17GB > >> >> after > >> >> building, 20M elements 26GB, 30M elements 28GB (and grows to > >> >> 35GB > >> >> shortly > >> >> after building). > >> >> > >> >> Here is output from a few runs with different number of > >> >> elements: > >> >> > >> >> > >> >> pypy mem.py 10000000 > >> >> start build > >> >> end build 84.142424 > >> >> that took a long time elapsed: 13.230586 slow_threshold: > >> >> 1.495401 > >> >> 90th_quantile_runtime: 0.421558 > >> >> that took a long time elapsed: 13.016531 slow_threshold: > >> >> 1.488160 > >> >> 90th_quantile_runtime: 0.423441 > >> >> that took a long time elapsed: 13.032537 slow_threshold: > >> >> 1.474563 > >> >> 90th_quantile_runtime: 0.419817 > >> >> > >> >> pypy mem.py 20000000 > >> >> start build > >> >> end build 180.823105 > >> >> that took a long time elapsed: 27.346064 slow_threshold: > >> >> 2.295146 > >> >> 90th_quantile_runtime: 0.434726 > >> >> that took a long time elapsed: 26.028852 slow_threshold: > >> >> 2.283927 > >> >> 90th_quantile_runtime: 0.374190 > >> >> that took a long time elapsed: 25.432279 slow_threshold: > >> >> 2.279631 > >> >> 90th_quantile_runtime: 0.371502 > >> >> > >> >> pypy mem.py 30000000 > >> >> start build > >> >> end build 276.217811 > >> >> that took a long time elapsed: 40.993855 slow_threshold: > >> >> 3.188464 > >> >> 90th_quantile_runtime: 0.459891 > >> >> that took a long time elapsed: 41.693553 slow_threshold: > >> >> 3.183003 > >> >> 90th_quantile_runtime: 0.393654 > >> >> that took a long time elapsed: 39.679769 slow_threshold: > >> >> 3.190782 > >> >> 90th_quantile_runtime: 0.393677 > >> >> that took a long time elapsed: 43.573411 slow_threshold: > >> >> 3.239637 > >> >> 90th_quantile_runtime: 0.393654 > >> >> > >> >> Code below > >> >> > >> >> -------------------------------------------------------------- > >> >> import time > >> >> from random import randint, choice > >> >> import sys > >> >> > >> >> > >> >> allElems = {} > >> >> > >> >> class Node: > >> >> def __init__(self, v_): > >> >> self.v = v_ > >> >> self.next = None > >> >> self.dummy_data = [randint(0,100) > >> >> for _ in xrange(randint(50,100))] > >> >> allElems[self.v] = self > >> >> if self.v > 0: > >> >> self.dummy_links = [allElems[randint(0, > >> >> self.v-1)] > >> >> for _ > >> >> in > >> >> xrange(10)] > >> >> else: > >> >> self.dummy_links = [self] > >> >> > >> >> def set_next(self, l): > >> >> self.next = l > >> >> > >> >> > >> >> def follow(node): > >> >> acc = [] > >> >> count = 0 > >> >> cur = node > >> >> assert node.v is not None > >> >> assert cur is not None > >> >> while count < 50000: > >> >> # return a value; generate some garbage > >> >> acc.append((cur.v, > >> >> [choice("abcdefghijklmnopqrstuvwxyz") > >> >> for > >> >> x > >> >> in > >> >> xrange(100)])) > >> >> > >> >> # if we have reached the end, chose a random link > >> >> cur = choice(cur.dummy_links) if cur.next is None > >> >> else > >> >> cur.next > >> >> count += 1 > >> >> > >> >> return acc > >> >> > >> >> > >> >> def build(num_elems): > >> >> start = time.time() > >> >> print "start build" > >> >> root = Node(0) > >> >> cur = root > >> >> for x in xrange(1, num_elems): > >> >> e = Node(x) > >> >> cur.next = e > >> >> cur = e > >> >> print "end build %f" % (time.time() - start) > >> >> return root > >> >> > >> >> > >> >> num_timings = 100 > >> >> if __name__ == "__main__": > >> >> num_elems = int(sys.argv[1]) > >> >> build(num_elems) > >> >> total = 0 > >> >> timings = [0.0] * num_timings # run times for the last > >> >> num_timings > >> >> runs > >> >> i = 0 > >> >> beginning = time.time() > >> >> while time.time() - beginning < 600: > >> >> start = time.time() > >> >> elem = allElems[randint(0, num_elems - 1)] > >> >> assert(elem is not None) > >> >> > >> >> lst = follow(elem) > >> >> > >> >> total += choice(lst)[0] # use the return value for > >> >> something > >> >> > >> >> end = time.time() > >> >> > >> >> elapsed = end-start > >> >> timings[i % num_timings] = elapsed > >> >> if (i > num_timings): > >> >> slow_time = 2 * sum(timings)/num_timings # slow > >> >> defined > >> >> as > >> >> > > >> >> 2*avg run time > >> >> if (elapsed > slow_time): > >> >> print "that took a long time elapsed: %f > >> >> slow_threshold: > >> >> %f 90th_quantile_runtime: %f" % \ > >> >> (elapsed, slow_time, > >> >> sorted(timings)[int(num_timings*.9)]) > >> >> i += 1 > >> >> print total > >> >> > >> >> > >> >> > >> >> > >> >> > >> >> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski > >> >> <fijall@gmail.com> > >> >> wrote: > >> >>> > >> >>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> > >> >>> wrote: > >> >>> > Hi Armin, Maciej > >> >>> > > >> >>> > Thanks for responding. > >> >>> > > >> >>> > I'm in the process of trying to determine what (if any) of > >> >>> > the > >> >>> > code > >> >>> > I'm > >> >>> > in a > >> >>> > position to share, and I'll get back to you. > >> >>> > > >> >>> > Allowing hinting to the GC would be good. Even better > >> >>> > would be > >> >>> > a > >> >>> > means > >> >>> > to > >> >>> > allow me to (transparently) allocate objects in unmanaged > >> >>> > memory, > >> >>> > but I > >> >>> > would expect that to be a tall order :) > >> >>> > > >> >>> > Thanks, > >> >>> > /Martin > >> >>> > >> >>> Hi Martin. > >> >>> > >> >>> Note that in case you want us to do the work of isolating > >> >>> the > >> >>> problem, > >> >>> we do offer paid support to do that (then we can sign NDAs > >> >>> and > >> >>> stuff). > >> >>> Otherwise we would be more than happy to fix bugs once you > >> >>> isolate > >> >>> a > >> >>> part you can share freely :) > >> >> > >> >> > >> > > > > >

Martin Koch

2:24 p.m.

Ah. I had misunderstood. I'll get back to you on that :) thanks /Martin

...

On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> wrote:

eh, this is not what I need

I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step)

...
On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor).

Please let me know if this is unclear :)

668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104

Thanks, /Martin

On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things:

a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those

and plot them

^^^ or just paste the results actually

...
...
On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote: Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow queries then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.

So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ?

Thanks, /Martin

On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
I think it's the cycles of your CPU

> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote: > What is the unit? Perhaps I'm being thick here, but I can't > correlate it > with seconds (which the program does print out). Slow runs are > around 13 > seconds, but are around 34*10^9(dec), 0x800000000 timestamp units > (e.g. > from > 0x2b994c9d31889c to 0x2b9944ab8c4f49). > > > > On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski > <fijall@gmail.com> > wrote: >> >> The number of lines is nonsense. This is a timestamp in hex. >> >> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> >> wrote: >>> Based On Maciej's suggestion, I tried the following >>> >>> PYPYLOG=- pypy mem.py 10000000 > out >>> >>> This generates a logfile which looks something like this >>> >>> start--> >>> [2b99f1981b527e] {gc-minor >>> [2b99f1981ba680] {gc-minor-walkroots >>> [2b99f1981c2e02] gc-minor-walkroots} >>> [2b99f19890d750] gc-minor} >>> [snip] >>> ... >>> <--stop >>> >>> >>> It turns out that the culprit is a lot of MINOR collections. >>> >>> I base this on the following observations: >>> >>> I can't understand the format of the timestamp on each logline >>> (the >>> "[2b99f1981b527e]"). From what I can see in the code, this should >>> be >>> output >>> from time.clock(), but that doesn't return a number like that >>> when I >>> run >>> pypy interactively >>> Instead, I count the number of debug lines between start--> and >>> the >>> corresponding <--stop. >>> Most runs have a few hundred lines of output between start/stop >>> All slow runs have very close to 57800 lines out output between >>> start/stop >>> One such sample does 9609 gc-collect-step operations, 9647 >>> gc-minor >>> operations, and 9647 gc-minor-walkroots operations. >>> >>> >>> Thanks, >>> /Martin >>> >>> >>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>> <fijall@gmail.com> >>> wrote: >>>> >>>> there is an environment variable PYPYLOG=gc:- (where - is >>>> stdout) >>>> which will do that for you btw. >>>> >>>> maybe you can find out what's that using profiling or valgrind? >>>> >>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> >>>> wrote: >>>>> I have tried getting the pypy source and building my own >>>>> version >>>>> of >>>>> pypy. I >>>>> have modified >>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>> to >>>>> print out when it starts and when it stops. Apparently, the >>>>> slow >>>>> queries >>>>> do >>>>> NOT occur during major_collection_step; at least, I have not >>>>> observed >>>>> major >>>>> step output during a query execution. So, apparently, >>>>> something >>>>> else >>>>> is >>>>> blocking. This could be another aspect of the GC, but it could >>>>> also >>>>> be >>>>> anything else. >>>>> >>>>> Just to be sure, I have tried running the same application in >>>>> python >>>>> with >>>>> garbage collection disabled. I don't see the problem there, so >>>>> it >>>>> is >>>>> somehow >>>>> related to either GC or the runtime somehow. >>>>> >>>>> Cheers, >>>>> /Martin >>>>> >>>>> >>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >>>>> wrote: >>>>>> >>>>>> We have hacked up a small sample that seems to exhibit the >>>>>> same >>>>>> issue. >>>>>> >>>>>> We basically generate a linked list of objects. To increase >>>>>> connectedness, >>>>>> elements in the list hold references (dummy_links) to 10 >>>>>> randomly >>>>>> chosen >>>>>> previous elements in the list. >>>>>> >>>>>> We then time a function that traverses 50000 elements from >>>>>> the >>>>>> list >>>>>> from a >>>>>> random start point. If the traversal reaches the end of the >>>>>> list, >>>>>> we >>>>>> instead >>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>> are >>>>>> traversed >>>>>> every time. To generate some garbage, we build a list holding >>>>>> the >>>>>> traversed >>>>>> elements and a dummy list of characters. >>>>>> >>>>>> Timings for the last 100 runs are stored in a circular >>>>>> buffer. If >>>>>> the >>>>>> elapsed time for the last run is more than twice the average >>>>>> time, >>>>>> we >>>>>> print >>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>> runtime >>>>>> (we >>>>>> would like to see that the mean runtime does not increase >>>>>> with >>>>>> the >>>>>> number of >>>>>> elements in the list, but that the max time does increase >>>>>> (linearly >>>>>> with the >>>>>> number of object, i guess); traversing 50K elements should be >>>>>> independent of >>>>>> the memory size). >>>>>> >>>>>> We have tried monitoring memory consumption by external >>>>>> inspection, >>>>>> but >>>>>> cannot consistently verify that memory is deallocated at the >>>>>> same >>>>>> time >>>>>> that >>>>>> we see slow requests. Perhaps the pypy runtime doesn't always >>>>>> return >>>>>> freed >>>>>> pages back to the OS? >>>>>> >>>>>> Using top, we observe that 10M elements allocates around 17GB >>>>>> after >>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>> 35GB >>>>>> shortly >>>>>> after building). >>>>>> >>>>>> Here is output from a few runs with different number of >>>>>> elements: >>>>>> >>>>>> >>>>>> pypy mem.py 10000000 >>>>>> start build >>>>>> end build 84.142424 >>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>> 1.495401 >>>>>> 90th_quantile_runtime: 0.421558 >>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>> 1.488160 >>>>>> 90th_quantile_runtime: 0.423441 >>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>> 1.474563 >>>>>> 90th_quantile_runtime: 0.419817 >>>>>> >>>>>> pypy mem.py 20000000 >>>>>> start build >>>>>> end build 180.823105 >>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>> 2.295146 >>>>>> 90th_quantile_runtime: 0.434726 >>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>> 2.283927 >>>>>> 90th_quantile_runtime: 0.374190 >>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>> 2.279631 >>>>>> 90th_quantile_runtime: 0.371502 >>>>>> >>>>>> pypy mem.py 30000000 >>>>>> start build >>>>>> end build 276.217811 >>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>> 3.188464 >>>>>> 90th_quantile_runtime: 0.459891 >>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>> 3.183003 >>>>>> 90th_quantile_runtime: 0.393654 >>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>> 3.190782 >>>>>> 90th_quantile_runtime: 0.393677 >>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>> 3.239637 >>>>>> 90th_quantile_runtime: 0.393654 >>>>>> >>>>>> Code below >>>>>> >>>>>> -------------------------------------------------------------- >>>>>> import time >>>>>> from random import randint, choice >>>>>> import sys >>>>>> >>>>>> >>>>>> allElems = {} >>>>>> >>>>>> class Node: >>>>>> def __init__(self, v_): >>>>>> self.v = v_ >>>>>> self.next = None >>>>>> self.dummy_data = [randint(0,100) >>>>>> for _ in xrange(randint(50,100))] >>>>>> allElems[self.v] = self >>>>>> if self.v > 0: >>>>>> self.dummy_links = [allElems[randint(0, >>>>>> self.v-1)] >>>>>> for _ >>>>>> in >>>>>> xrange(10)] >>>>>> else: >>>>>> self.dummy_links = [self] >>>>>> >>>>>> def set_next(self, l): >>>>>> self.next = l >>>>>> >>>>>> >>>>>> def follow(node): >>>>>> acc = [] >>>>>> count = 0 >>>>>> cur = node >>>>>> assert node.v is not None >>>>>> assert cur is not None >>>>>> while count < 50000: >>>>>> # return a value; generate some garbage >>>>>> acc.append((cur.v, >>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>> for >>>>>> x >>>>>> in >>>>>> xrange(100)])) >>>>>> >>>>>> # if we have reached the end, chose a random link >>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>> else >>>>>> cur.next >>>>>> count += 1 >>>>>> >>>>>> return acc >>>>>> >>>>>> >>>>>> def build(num_elems): >>>>>> start = time.time() >>>>>> print "start build" >>>>>> root = Node(0) >>>>>> cur = root >>>>>> for x in xrange(1, num_elems): >>>>>> e = Node(x) >>>>>> cur.next = e >>>>>> cur = e >>>>>> print "end build %f" % (time.time() - start) >>>>>> return root >>>>>> >>>>>> >>>>>> num_timings = 100 >>>>>> if __name__ == "__main__": >>>>>> num_elems = int(sys.argv[1]) >>>>>> build(num_elems) >>>>>> total = 0 >>>>>> timings = [0.0] * num_timings # run times for the last >>>>>> num_timings >>>>>> runs >>>>>> i = 0 >>>>>> beginning = time.time() >>>>>> while time.time() - beginning < 600: >>>>>> start = time.time() >>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>> assert(elem is not None) >>>>>> >>>>>> lst = follow(elem) >>>>>> >>>>>> total += choice(lst)[0] # use the return value for >>>>>> something >>>>>> >>>>>> end = time.time() >>>>>> >>>>>> elapsed = end-start >>>>>> timings[i % num_timings] = elapsed >>>>>> if (i > num_timings): >>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>> defined >>>>>> as >>>>>>> >>>>>> 2*avg run time >>>>>> if (elapsed > slow_time): >>>>>> print "that took a long time elapsed: %f >>>>>> slow_threshold: >>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>> (elapsed, slow_time, >>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>> i += 1 >>>>>> print total >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>> <fijall@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> >>>>>>> wrote: >>>>>>>> Hi Armin, Maciej >>>>>>>> >>>>>>>> Thanks for responding. >>>>>>>> >>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>> the >>>>>>>> code >>>>>>>> I'm >>>>>>>> in a >>>>>>>> position to share, and I'll get back to you. >>>>>>>> >>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>> would be >>>>>>>> a >>>>>>>> means >>>>>>>> to >>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>> memory, >>>>>>>> but I >>>>>>>> would expect that to be a tall order :) >>>>>>>> >>>>>>>> Thanks, >>>>>>>> /Martin >>>>>>> >>>>>>> Hi Martin. >>>>>>> >>>>>>> Note that in case you want us to do the work of isolating >>>>>>> the >>>>>>> problem, >>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>> and >>>>>>> stuff). >>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>> isolate >>>>>>> a >>>>>>> part you can share freely :) >>>>>> >>>>>> >>>>> >>> >>> > >

Martin Koch

3:35 p.m.

Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow. *Totals*: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 *Max*: gc-minor:10 gc-collect-step:247 *Totals*: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 *Max*: gc-minor:10 gc-collect-step:245 *Totals*: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 *Max*: gc-minor:11 gc-collect-step:244 *Totals*: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 *Max*: gc-minor:17 gc-collect-step:244 *Totals*: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 *Max*: gc-minor:11 gc-collect-step:248 *Totals*: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 *Max*: gc-minor:8 gc-collect-step:299 *Totals*: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 *Max*: gc-minor:11 gc-collect-step:246 *Totals*: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 *Max*: gc-minor:36 gc-collect-step:248 *Totals*: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 *Max*: gc-minor:38 gc-collect-step:244 *Totals*: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 *Max*: gc-minor:23 gc-collect-step:245 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 *Max*: gc-minor:9 gc-collect-step:244 *Totals*: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 *Max*: gc-minor:8 gc-collect-step:248 *Totals*: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 *Max*: gc-minor:8 gc-collect-step:250 *Totals*: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 *Max*: gc-minor:543 gc-collect-step:244 *Totals*: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 *Max*: gc-minor:20 gc-collect-step:246 *Totals*: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 *Max*: gc-minor:25 gc-collect-step:245 Thanks, /Martin On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch <mak@issuu.com> wrote:

...

Ah. I had misunderstood. I'll get back to you on that :) thanks

/Martin

...
On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> wrote:

eh, this is not what I need

I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step)

...
On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor).

Please let me know if this is unclear :)

668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104

Thanks, /Martin

On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things:

a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those

and plot them

^^^ or just paste the results actually

...
...
On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote: Well, then it works out to around 2.5GHz, which seems reasonable. But it doesn't alter the conclusion from the previous email: The slow

queries

...
...
...
then all have a duration around 34*10^9 units, 'normal' queries 1*10^9 units, or .4 seconds at this conversion. Also, the log shows that a slow query performs many more gc-minor operations than a 'normal' one: 9600 gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58.

So the question becomes: Why do we get this large spike in gc-minor-walkroots, and, in particular, is there any way to avoid it :) ?

Thanks, /Martin

On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski < fijall@gmail.com> wrote: > > I think it's the cycles of your CPU > >> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote: >> What is the unit? Perhaps I'm being thick here, but I can't >> correlate it >> with seconds (which the program does print out). Slow runs are >> around 13 >> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >> (e.g. >> from >> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >> >> >> >> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >> <fijall@gmail.com> >> wrote: >>> >>> The number of lines is nonsense. This is a timestamp in hex. >>> >>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> >>> wrote: >>>> Based On Maciej's suggestion, I tried the following >>>> >>>> PYPYLOG=- pypy mem.py 10000000 > out >>>> >>>> This generates a logfile which looks something like this >>>> >>>> start--> >>>> [2b99f1981b527e] {gc-minor >>>> [2b99f1981ba680] {gc-minor-walkroots >>>> [2b99f1981c2e02] gc-minor-walkroots} >>>> [2b99f19890d750] gc-minor} >>>> [snip] >>>> ... >>>> <--stop >>>> >>>> >>>> It turns out that the culprit is a lot of MINOR collections. >>>> >>>> I base this on the following observations: >>>> >>>> I can't understand the format of the timestamp on each logline >>>> (the >>>> "[2b99f1981b527e]"). From what I can see in the code, this should >>>> be >>>> output >>>> from time.clock(), but that doesn't return a number like that >>>> when I >>>> run >>>> pypy interactively >>>> Instead, I count the number of debug lines between start--> and >>>> the >>>> corresponding <--stop. >>>> Most runs have a few hundred lines of output between start/stop >>>> All slow runs have very close to 57800 lines out output between >>>> start/stop >>>> One such sample does 9609 gc-collect-step operations, 9647 >>>> gc-minor >>>> operations, and 9647 gc-minor-walkroots operations. >>>> >>>> >>>> Thanks, >>>> /Martin >>>> >>>> >>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>> <fijall@gmail.com> >>>> wrote: >>>>> >>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>> stdout) >>>>> which will do that for you btw. >>>>> >>>>> maybe you can find out what's that using profiling or valgrind? >>>>> >>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> >>>>> wrote: >>>>>> I have tried getting the pypy source and building my own >>>>>> version >>>>>> of >>>>>> pypy. I >>>>>> have modified >>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>> to >>>>>> print out when it starts and when it stops. Apparently, the >>>>>> slow >>>>>> queries >>>>>> do >>>>>> NOT occur during major_collection_step; at least, I have not >>>>>> observed >>>>>> major >>>>>> step output during a query execution. So, apparently, >>>>>> something >>>>>> else >>>>>> is >>>>>> blocking. This could be another aspect of the GC, but it could >>>>>> also >>>>>> be >>>>>> anything else. >>>>>> >>>>>> Just to be sure, I have tried running the same application in >>>>>> python >>>>>> with >>>>>> garbage collection disabled. I don't see the problem there, so >>>>>> it >>>>>> is >>>>>> somehow >>>>>> related to either GC or the runtime somehow. >>>>>> >>>>>> Cheers, >>>>>> /Martin >>>>>> >>>>>> >>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >>>>>> wrote: >>>>>>> >>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>> same >>>>>>> issue. >>>>>>> >>>>>>> We basically generate a linked list of objects. To increase >>>>>>> connectedness, >>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>> randomly >>>>>>> chosen >>>>>>> previous elements in the list. >>>>>>> >>>>>>> We then time a function that traverses 50000 elements from >>>>>>> the >>>>>>> list >>>>>>> from a >>>>>>> random start point. If the traversal reaches the end of the >>>>>>> list, >>>>>>> we >>>>>>> instead >>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>> are >>>>>>> traversed >>>>>>> every time. To generate some garbage, we build a list holding >>>>>>> the >>>>>>> traversed >>>>>>> elements and a dummy list of characters. >>>>>>> >>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>> buffer. If >>>>>>> the >>>>>>> elapsed time for the last run is more than twice the average >>>>>>> time, >>>>>>> we >>>>>>> print >>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>> runtime >>>>>>> (we >>>>>>> would like to see that the mean runtime does not increase >>>>>>> with >>>>>>> the >>>>>>> number of >>>>>>> elements in the list, but that the max time does increase >>>>>>> (linearly >>>>>>> with the >>>>>>> number of object, i guess); traversing 50K elements should be >>>>>>> independent of >>>>>>> the memory size). >>>>>>> >>>>>>> We have tried monitoring memory consumption by external >>>>>>> inspection, >>>>>>> but >>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>> same >>>>>>> time >>>>>>> that >>>>>>> we see slow requests. Perhaps the pypy runtime doesn't always >>>>>>> return >>>>>>> freed >>>>>>> pages back to the OS? >>>>>>> >>>>>>> Using top, we observe that 10M elements allocates around 17GB >>>>>>> after >>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>> 35GB >>>>>>> shortly >>>>>>> after building). >>>>>>> >>>>>>> Here is output from a few runs with different number of >>>>>>> elements: >>>>>>> >>>>>>> >>>>>>> pypy mem.py 10000000 >>>>>>> start build >>>>>>> end build 84.142424 >>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>> 1.495401 >>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>> 1.488160 >>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>> 1.474563 >>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>> >>>>>>> pypy mem.py 20000000 >>>>>>> start build >>>>>>> end build 180.823105 >>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>> 2.295146 >>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>> 2.283927 >>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>> 2.279631 >>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>> >>>>>>> pypy mem.py 30000000 >>>>>>> start build >>>>>>> end build 276.217811 >>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>> 3.188464 >>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>> 3.183003 >>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>> 3.190782 >>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>> 3.239637 >>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>> >>>>>>> Code below >>>>>>> >>>>>>> -------------------------------------------------------------- >>>>>>> import time >>>>>>> from random import randint, choice >>>>>>> import sys >>>>>>> >>>>>>> >>>>>>> allElems = {} >>>>>>> >>>>>>> class Node: >>>>>>> def __init__(self, v_): >>>>>>> self.v = v_ >>>>>>> self.next = None >>>>>>> self.dummy_data = [randint(0,100) >>>>>>> for _ in xrange(randint(50,100))] >>>>>>> allElems[self.v] = self >>>>>>> if self.v > 0: >>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>> self.v-1)] >>>>>>> for _ >>>>>>> in >>>>>>> xrange(10)] >>>>>>> else: >>>>>>> self.dummy_links = [self] >>>>>>> >>>>>>> def set_next(self, l): >>>>>>> self.next = l >>>>>>> >>>>>>> >>>>>>> def follow(node): >>>>>>> acc = [] >>>>>>> count = 0 >>>>>>> cur = node >>>>>>> assert node.v is not None >>>>>>> assert cur is not None >>>>>>> while count < 50000: >>>>>>> # return a value; generate some garbage >>>>>>> acc.append((cur.v, >>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>> for >>>>>>> x >>>>>>> in >>>>>>> xrange(100)])) >>>>>>> >>>>>>> # if we have reached the end, chose a random link >>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>> else >>>>>>> cur.next >>>>>>> count += 1 >>>>>>> >>>>>>> return acc >>>>>>> >>>>>>> >>>>>>> def build(num_elems): >>>>>>> start = time.time() >>>>>>> print "start build" >>>>>>> root = Node(0) >>>>>>> cur = root >>>>>>> for x in xrange(1, num_elems): >>>>>>> e = Node(x) >>>>>>> cur.next = e >>>>>>> cur = e >>>>>>> print "end build %f" % (time.time() - start) >>>>>>> return root >>>>>>> >>>>>>> >>>>>>> num_timings = 100 >>>>>>> if __name__ == "__main__": >>>>>>> num_elems = int(sys.argv[1]) >>>>>>> build(num_elems) >>>>>>> total = 0 >>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>> num_timings >>>>>>> runs >>>>>>> i = 0 >>>>>>> beginning = time.time() >>>>>>> while time.time() - beginning < 600: >>>>>>> start = time.time() >>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>> assert(elem is not None) >>>>>>> >>>>>>> lst = follow(elem) >>>>>>> >>>>>>> total += choice(lst)[0] # use the return value for >>>>>>> something >>>>>>> >>>>>>> end = time.time() >>>>>>> >>>>>>> elapsed = end-start >>>>>>> timings[i % num_timings] = elapsed >>>>>>> if (i > num_timings): >>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>> defined >>>>>>> as >>>>>>>> >>>>>>> 2*avg run time >>>>>>> if (elapsed > slow_time): >>>>>>> print "that took a long time elapsed: %f >>>>>>> slow_threshold: >>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>> (elapsed, slow_time, >>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>> i += 1 >>>>>>> print total >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>> <fijall@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com> >>>>>>>> wrote: >>>>>>>>> Hi Armin, Maciej >>>>>>>>> >>>>>>>>> Thanks for responding. >>>>>>>>> >>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>> the >>>>>>>>> code >>>>>>>>> I'm >>>>>>>>> in a >>>>>>>>> position to share, and I'll get back to you. >>>>>>>>> >>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>> would be >>>>>>>>> a >>>>>>>>> means >>>>>>>>> to >>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>> memory, >>>>>>>>> but I >>>>>>>>> would expect that to be a tall order :) >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> /Martin >>>>>>>> >>>>>>>> Hi Martin. >>>>>>>> >>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>> the >>>>>>>> problem, >>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>> and >>>>>>>> stuff). >>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>> isolate >>>>>>>> a >>>>>>>> part you can share freely :) >>>>>>> >>>>>>> >>>>>> >>>> >>>> >> >>

Martin Koch

3:37 p.m.

Ah - it just occured to me that the first runs may be slow anyway: Since we take the average of the last 100 runs as the benchmark, then the first 100 runs are not classified as slow. Indeed, the first three runs with many collections are in the first 100 runs. On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch <mak@issuu.com> wrote:

...

Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow.

*Totals*: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 *Max*: gc-minor:10 gc-collect-step:247 *Totals*: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 *Max*: gc-minor:10 gc-collect-step:245 *Totals*: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 *Max*: gc-minor:11 gc-collect-step:244 *Totals*: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 *Max*: gc-minor:17 gc-collect-step:244 *Totals*: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 *Max*: gc-minor:11 gc-collect-step:248 *Totals*: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 *Max*: gc-minor:8 gc-collect-step:299 *Totals*: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 *Max*: gc-minor:11 gc-collect-step:246 *Totals*: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 *Max*: gc-minor:36 gc-collect-step:248 *Totals*: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 *Max*: gc-minor:8 gc-collect-step:244 *Totals*: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 *Max*: gc-minor:38 gc-collect-step:244 *Totals*: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 *Max*: gc-minor:23 gc-collect-step:245 *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 *Max*: gc-minor:9 gc-collect-step:244 *Totals*: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 *Max*: gc-minor:8 gc-collect-step:246 *Totals*: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 *Max*: gc-minor:8 gc-collect-step:248 *Totals*: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 *Max*: gc-minor:8 gc-collect-step:250 *Totals*: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 *Max*: gc-minor:8 gc-collect-step:245 *Totals*: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 *Max*: gc-minor:543 gc-collect-step:244 *Totals*: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 *Max*: gc-minor:20 gc-collect-step:246 *Totals*: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 *Max*: gc-minor:25 gc-collect-step:245

Thanks, /Martin

On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch <mak@issuu.com> wrote:

...
Ah. I had misunderstood. I'll get back to you on that :) thanks

/Martin

...
On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> wrote:

eh, this is not what I need

I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step)

...
On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor).

Please let me know if this is unclear :)

668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104

Thanks, /Martin

On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall@gmail.com

...
...
wrote:

...
are you *sure* it's the walkroots that take that long and not something else (like gc-minor)? More of those mean that you allocate a lot more surviving objects. Can you do two things:

a) take a max of gc-minor (and gc-minor-stackwalk), per request b) take the sum of those

and plot them

^^^ or just paste the results actually

...
> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote: > Well, then it works out to around 2.5GHz, which seems reasonable.

But

...
> it > doesn't alter the conclusion from the previous email: The slow queries > then > all have a duration around 34*10^9 units, 'normal' queries 1*10^9 > units, or > .4 seconds at this conversion. Also, the log shows that a slow query > performs many more gc-minor operations than a 'normal' one: 9600 > gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. > > So the question becomes: Why do we get this large spike in > gc-minor-walkroots, and, in particular, is there any way to avoid it :) > ? > > Thanks, > /Martin > > > On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski < fijall@gmail.com> > wrote: >> >> I think it's the cycles of your CPU >> >>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> wrote: >>> What is the unit? Perhaps I'm being thick here, but I can't >>> correlate it >>> with seconds (which the program does print out). Slow runs are >>> around 13 >>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>> (e.g. >>> from >>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>> >>> >>> >>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>> <fijall@gmail.com> >>> wrote: >>>> >>>> The number of lines is nonsense. This is a timestamp in hex. >>>> >>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> >>>> wrote: >>>>> Based On Maciej's suggestion, I tried the following >>>>> >>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>> >>>>> This generates a logfile which looks something like this >>>>> >>>>> start--> >>>>> [2b99f1981b527e] {gc-minor >>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>> [2b99f19890d750] gc-minor} >>>>> [snip] >>>>> ... >>>>> <--stop >>>>> >>>>> >>>>> It turns out that the culprit is a lot of MINOR collections. >>>>> >>>>> I base this on the following observations: >>>>> >>>>> I can't understand the format of the timestamp on each logline >>>>> (the >>>>> "[2b99f1981b527e]"). From what I can see in the code, this should >>>>> be >>>>> output >>>>> from time.clock(), but that doesn't return a number like that >>>>> when I >>>>> run >>>>> pypy interactively >>>>> Instead, I count the number of debug lines between start--> and >>>>> the >>>>> corresponding <--stop. >>>>> Most runs have a few hundred lines of output between start/stop >>>>> All slow runs have very close to 57800 lines out output between >>>>> start/stop >>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>> gc-minor >>>>> operations, and 9647 gc-minor-walkroots operations. >>>>> >>>>> >>>>> Thanks, >>>>> /Martin >>>>> >>>>> >>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>> <fijall@gmail.com> >>>>> wrote: >>>>>> >>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>> stdout) >>>>>> which will do that for you btw. >>>>>> >>>>>> maybe you can find out what's that using profiling or valgrind? >>>>>> >>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> >>>>>> wrote: >>>>>>> I have tried getting the pypy source and building my own >>>>>>> version >>>>>>> of >>>>>>> pypy. I >>>>>>> have modified >>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>> to >>>>>>> print out when it starts and when it stops. Apparently, the >>>>>>> slow >>>>>>> queries >>>>>>> do >>>>>>> NOT occur during major_collection_step; at least, I have not >>>>>>> observed >>>>>>> major >>>>>>> step output during a query execution. So, apparently, >>>>>>> something >>>>>>> else >>>>>>> is >>>>>>> blocking. This could be another aspect of the GC, but it could >>>>>>> also >>>>>>> be >>>>>>> anything else. >>>>>>> >>>>>>> Just to be sure, I have tried running the same application in >>>>>>> python >>>>>>> with >>>>>>> garbage collection disabled. I don't see the problem there, so >>>>>>> it >>>>>>> is >>>>>>> somehow >>>>>>> related to either GC or the runtime somehow. >>>>>>> >>>>>>> Cheers, >>>>>>> /Martin >>>>>>> >>>>>>> >>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>>> same >>>>>>>> issue. >>>>>>>> >>>>>>>> We basically generate a linked list of objects. To increase >>>>>>>> connectedness, >>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>> randomly >>>>>>>> chosen >>>>>>>> previous elements in the list. >>>>>>>> >>>>>>>> We then time a function that traverses 50000 elements from >>>>>>>> the >>>>>>>> list >>>>>>>> from a >>>>>>>> random start point. If the traversal reaches the end of the >>>>>>>> list, >>>>>>>> we >>>>>>>> instead >>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>>> are >>>>>>>> traversed >>>>>>>> every time. To generate some garbage, we build a list holding >>>>>>>> the >>>>>>>> traversed >>>>>>>> elements and a dummy list of characters. >>>>>>>> >>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>> buffer. If >>>>>>>> the >>>>>>>> elapsed time for the last run is more than twice the average >>>>>>>> time, >>>>>>>> we >>>>>>>> print >>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>>> runtime >>>>>>>> (we >>>>>>>> would like to see that the mean runtime does not increase >>>>>>>> with >>>>>>>> the >>>>>>>> number of >>>>>>>> elements in the list, but that the max time does increase >>>>>>>> (linearly >>>>>>>> with the >>>>>>>> number of object, i guess); traversing 50K elements should be >>>>>>>> independent of >>>>>>>> the memory size). >>>>>>>> >>>>>>>> We have tried monitoring memory consumption by external >>>>>>>> inspection, >>>>>>>> but >>>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>>> same >>>>>>>> time >>>>>>>> that >>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't always >>>>>>>> return >>>>>>>> freed >>>>>>>> pages back to the OS? >>>>>>>> >>>>>>>> Using top, we observe that 10M elements allocates around 17GB >>>>>>>> after >>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>>> 35GB >>>>>>>> shortly >>>>>>>> after building). >>>>>>>> >>>>>>>> Here is output from a few runs with different number of >>>>>>>> elements: >>>>>>>> >>>>>>>> >>>>>>>> pypy mem.py 10000000 >>>>>>>> start build >>>>>>>> end build 84.142424 >>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>>> 1.495401 >>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>>> 1.488160 >>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>>> 1.474563 >>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>> >>>>>>>> pypy mem.py 20000000 >>>>>>>> start build >>>>>>>> end build 180.823105 >>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>>> 2.295146 >>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>>> 2.283927 >>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>>> 2.279631 >>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>> >>>>>>>> pypy mem.py 30000000 >>>>>>>> start build >>>>>>>> end build 276.217811 >>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>>> 3.188464 >>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>>> 3.183003 >>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>>> 3.190782 >>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>>> 3.239637 >>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>> >>>>>>>> Code below >>>>>>>> >>>>>>>>

...
...
...
...
>>>>>>>> import time >>>>>>>> from random import randint, choice >>>>>>>> import sys >>>>>>>> >>>>>>>> >>>>>>>> allElems = {} >>>>>>>> >>>>>>>> class Node: >>>>>>>> def __init__(self, v_): >>>>>>>> self.v = v_ >>>>>>>> self.next = None >>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>> for _ in xrange(randint(50,100))] >>>>>>>> allElems[self.v] = self >>>>>>>> if self.v > 0: >>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>> self.v-1)] >>>>>>>> for _ >>>>>>>> in >>>>>>>> xrange(10)] >>>>>>>> else: >>>>>>>> self.dummy_links = [self] >>>>>>>> >>>>>>>> def set_next(self, l): >>>>>>>> self.next = l >>>>>>>> >>>>>>>> >>>>>>>> def follow(node): >>>>>>>> acc = [] >>>>>>>> count = 0 >>>>>>>> cur = node >>>>>>>> assert node.v is not None >>>>>>>> assert cur is not None >>>>>>>> while count < 50000: >>>>>>>> # return a value; generate some garbage >>>>>>>> acc.append((cur.v, >>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>> for >>>>>>>> x >>>>>>>> in >>>>>>>> xrange(100)])) >>>>>>>> >>>>>>>> # if we have reached the end, chose a random link >>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>> else >>>>>>>> cur.next >>>>>>>> count += 1 >>>>>>>> >>>>>>>> return acc >>>>>>>> >>>>>>>> >>>>>>>> def build(num_elems): >>>>>>>> start = time.time() >>>>>>>> print "start build" >>>>>>>> root = Node(0) >>>>>>>> cur = root >>>>>>>> for x in xrange(1, num_elems): >>>>>>>> e = Node(x) >>>>>>>> cur.next = e >>>>>>>> cur = e >>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>> return root >>>>>>>> >>>>>>>> >>>>>>>> num_timings = 100 >>>>>>>> if __name__ == "__main__": >>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>> build(num_elems) >>>>>>>> total = 0 >>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>> num_timings >>>>>>>> runs >>>>>>>> i = 0 >>>>>>>> beginning = time.time() >>>>>>>> while time.time() - beginning < 600: >>>>>>>> start = time.time() >>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>> assert(elem is not None) >>>>>>>> >>>>>>>> lst = follow(elem) >>>>>>>> >>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>> something >>>>>>>> >>>>>>>> end = time.time() >>>>>>>> >>>>>>>> elapsed = end-start >>>>>>>> timings[i % num_timings] = elapsed >>>>>>>> if (i > num_timings): >>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>>> defined >>>>>>>> as >>>>>>>>> >>>>>>>> 2*avg run time >>>>>>>> if (elapsed > slow_time): >>>>>>>> print "that took a long time elapsed: %f >>>>>>>> slow_threshold: >>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>> (elapsed, slow_time, >>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>> i += 1 >>>>>>>> print total >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>> <fijall@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch <mak@issuu.com

...
...
...
>>>>>>>>> wrote: >>>>>>>>>> Hi Armin, Maciej >>>>>>>>>> >>>>>>>>>> Thanks for responding. >>>>>>>>>> >>>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>>> the >>>>>>>>>> code >>>>>>>>>> I'm >>>>>>>>>> in a >>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>> >>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>> would be >>>>>>>>>> a >>>>>>>>>> means >>>>>>>>>> to >>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>>> memory, >>>>>>>>>> but I >>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> /Martin >>>>>>>>> >>>>>>>>> Hi Martin. >>>>>>>>> >>>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>>> the >>>>>>>>> problem, >>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>>> and >>>>>>>>> stuff). >>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>>> isolate >>>>>>>>> a >>>>>>>>> part you can share freely :) >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>> >>> >>> > >

Maciej Fijalkowski

3:41 p.m.

ok. so as you can probably see, the max is not that big, which means the GC is really incremental. What happens is you get tons of garbage that survives minor collection every now and then. I don't exactly know why, but you should look what objects can potentially survive for too long. On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch <mak@issuu.com> wrote:

...

Ah - it just occured to me that the first runs may be slow anyway: Since we take the average of the last 100 runs as the benchmark, then the first 100 runs are not classified as slow. Indeed, the first three runs with many collections are in the first 100 runs.

On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch <mak@issuu.com> wrote:

...
Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow.

Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: gc-minor:10 gc-collect-step:247 Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: gc-minor:10 gc-collect-step:245 Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: gc-minor:11 gc-collect-step:244 Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 Max: gc-minor:17 gc-collect-step:244 Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 Max: gc-minor:11 gc-collect-step:248 Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 Max: gc-minor:8 gc-collect-step:299 Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 Max: gc-minor:11 gc-collect-step:246 Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 Max: gc-minor:36 gc-collect-step:248 Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 Max: gc-minor:38 gc-collect-step:244 Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 Max: gc-minor:23 gc-collect-step:245 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 Max: gc-minor:9 gc-collect-step:244 Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 Max: gc-minor:8 gc-collect-step:248 Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 Max: gc-minor:8 gc-collect-step:250 Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 Max: gc-minor:543 gc-collect-step:244 Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 Max: gc-minor:20 gc-collect-step:246 Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 Max: gc-minor:25 gc-collect-step:245

Thanks, /Martin

On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch <mak@issuu.com> wrote:

...
Ah. I had misunderstood. I'll get back to you on that :) thanks

/Martin

...
On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> wrote:

eh, this is not what I need

I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step)

...
On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor).

Please let me know if this is unclear :)

668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104

Thanks, /Martin

On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski <fijall@gmail.com> wrote: > are you *sure* it's the walkroots that take that long and not > something else (like gc-minor)? More of those mean that you allocate > a > lot more surviving objects. Can you do two things: > > a) take a max of gc-minor (and gc-minor-stackwalk), per request > b) take the sum of those > > and plot them

^^^ or just paste the results actually

> >> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote: >> Well, then it works out to around 2.5GHz, which seems reasonable. >> But >> it >> doesn't alter the conclusion from the previous email: The slow >> queries >> then >> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >> units, or >> .4 seconds at this conversion. Also, the log shows that a slow >> query >> performs many more gc-minor operations than a 'normal' one: 9600 >> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >> >> So the question becomes: Why do we get this large spike in >> gc-minor-walkroots, and, in particular, is there any way to avoid >> it :) >> ? >> >> Thanks, >> /Martin >> >> >> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >> <fijall@gmail.com> >> wrote: >>> >>> I think it's the cycles of your CPU >>> >>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> >>>> wrote: >>>> What is the unit? Perhaps I'm being thick here, but I can't >>>> correlate it >>>> with seconds (which the program does print out). Slow runs are >>>> around 13 >>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>>> (e.g. >>>> from >>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>>> >>>> >>>> >>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>>> <fijall@gmail.com> >>>> wrote: >>>>> >>>>> The number of lines is nonsense. This is a timestamp in hex. >>>>> >>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> >>>>> wrote: >>>>>> Based On Maciej's suggestion, I tried the following >>>>>> >>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>>> >>>>>> This generates a logfile which looks something like this >>>>>> >>>>>> start--> >>>>>> [2b99f1981b527e] {gc-minor >>>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>>> [2b99f19890d750] gc-minor} >>>>>> [snip] >>>>>> ... >>>>>> <--stop >>>>>> >>>>>> >>>>>> It turns out that the culprit is a lot of MINOR collections. >>>>>> >>>>>> I base this on the following observations: >>>>>> >>>>>> I can't understand the format of the timestamp on each logline >>>>>> (the >>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >>>>>> should >>>>>> be >>>>>> output >>>>>> from time.clock(), but that doesn't return a number like that >>>>>> when I >>>>>> run >>>>>> pypy interactively >>>>>> Instead, I count the number of debug lines between start--> and >>>>>> the >>>>>> corresponding <--stop. >>>>>> Most runs have a few hundred lines of output between start/stop >>>>>> All slow runs have very close to 57800 lines out output between >>>>>> start/stop >>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>>> gc-minor >>>>>> operations, and 9647 gc-minor-walkroots operations. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> /Martin >>>>>> >>>>>> >>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>>> <fijall@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>>> stdout) >>>>>>> which will do that for you btw. >>>>>>> >>>>>>> maybe you can find out what's that using profiling or >>>>>>> valgrind? >>>>>>> >>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> >>>>>>> wrote: >>>>>>>> I have tried getting the pypy source and building my own >>>>>>>> version >>>>>>>> of >>>>>>>> pypy. I >>>>>>>> have modified >>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>>> to >>>>>>>> print out when it starts and when it stops. Apparently, the >>>>>>>> slow >>>>>>>> queries >>>>>>>> do >>>>>>>> NOT occur during major_collection_step; at least, I have not >>>>>>>> observed >>>>>>>> major >>>>>>>> step output during a query execution. So, apparently, >>>>>>>> something >>>>>>>> else >>>>>>>> is >>>>>>>> blocking. This could be another aspect of the GC, but it >>>>>>>> could >>>>>>>> also >>>>>>>> be >>>>>>>> anything else. >>>>>>>> >>>>>>>> Just to be sure, I have tried running the same application in >>>>>>>> python >>>>>>>> with >>>>>>>> garbage collection disabled. I don't see the problem there, >>>>>>>> so >>>>>>>> it >>>>>>>> is >>>>>>>> somehow >>>>>>>> related to either GC or the runtime somehow. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> /Martin >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>>>> same >>>>>>>>> issue. >>>>>>>>> >>>>>>>>> We basically generate a linked list of objects. To increase >>>>>>>>> connectedness, >>>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>>> randomly >>>>>>>>> chosen >>>>>>>>> previous elements in the list. >>>>>>>>> >>>>>>>>> We then time a function that traverses 50000 elements from >>>>>>>>> the >>>>>>>>> list >>>>>>>>> from a >>>>>>>>> random start point. If the traversal reaches the end of the >>>>>>>>> list, >>>>>>>>> we >>>>>>>>> instead >>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>>>> are >>>>>>>>> traversed >>>>>>>>> every time. To generate some garbage, we build a list >>>>>>>>> holding >>>>>>>>> the >>>>>>>>> traversed >>>>>>>>> elements and a dummy list of characters. >>>>>>>>> >>>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>>> buffer. If >>>>>>>>> the >>>>>>>>> elapsed time for the last run is more than twice the average >>>>>>>>> time, >>>>>>>>> we >>>>>>>>> print >>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>>>> runtime >>>>>>>>> (we >>>>>>>>> would like to see that the mean runtime does not increase >>>>>>>>> with >>>>>>>>> the >>>>>>>>> number of >>>>>>>>> elements in the list, but that the max time does increase >>>>>>>>> (linearly >>>>>>>>> with the >>>>>>>>> number of object, i guess); traversing 50K elements should >>>>>>>>> be >>>>>>>>> independent of >>>>>>>>> the memory size). >>>>>>>>> >>>>>>>>> We have tried monitoring memory consumption by external >>>>>>>>> inspection, >>>>>>>>> but >>>>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>>>> same >>>>>>>>> time >>>>>>>>> that >>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >>>>>>>>> always >>>>>>>>> return >>>>>>>>> freed >>>>>>>>> pages back to the OS? >>>>>>>>> >>>>>>>>> Using top, we observe that 10M elements allocates around >>>>>>>>> 17GB >>>>>>>>> after >>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>>>> 35GB >>>>>>>>> shortly >>>>>>>>> after building). >>>>>>>>> >>>>>>>>> Here is output from a few runs with different number of >>>>>>>>> elements: >>>>>>>>> >>>>>>>>> >>>>>>>>> pypy mem.py 10000000 >>>>>>>>> start build >>>>>>>>> end build 84.142424 >>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>>>> 1.495401 >>>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>>>> 1.488160 >>>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>>>> 1.474563 >>>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>>> >>>>>>>>> pypy mem.py 20000000 >>>>>>>>> start build >>>>>>>>> end build 180.823105 >>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>>>> 2.295146 >>>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>>>> 2.283927 >>>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>>>> 2.279631 >>>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>>> >>>>>>>>> pypy mem.py 30000000 >>>>>>>>> start build >>>>>>>>> end build 276.217811 >>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>>>> 3.188464 >>>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>>>> 3.183003 >>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>>>> 3.190782 >>>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>>>> 3.239637 >>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>> >>>>>>>>> Code below >>>>>>>>> >>>>>>>>> >>>>>>>>> -------------------------------------------------------------- >>>>>>>>> import time >>>>>>>>> from random import randint, choice >>>>>>>>> import sys >>>>>>>>> >>>>>>>>> >>>>>>>>> allElems = {} >>>>>>>>> >>>>>>>>> class Node: >>>>>>>>> def __init__(self, v_): >>>>>>>>> self.v = v_ >>>>>>>>> self.next = None >>>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>>> for _ in xrange(randint(50,100))] >>>>>>>>> allElems[self.v] = self >>>>>>>>> if self.v > 0: >>>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>>> self.v-1)] >>>>>>>>> for _ >>>>>>>>> in >>>>>>>>> xrange(10)] >>>>>>>>> else: >>>>>>>>> self.dummy_links = [self] >>>>>>>>> >>>>>>>>> def set_next(self, l): >>>>>>>>> self.next = l >>>>>>>>> >>>>>>>>> >>>>>>>>> def follow(node): >>>>>>>>> acc = [] >>>>>>>>> count = 0 >>>>>>>>> cur = node >>>>>>>>> assert node.v is not None >>>>>>>>> assert cur is not None >>>>>>>>> while count < 50000: >>>>>>>>> # return a value; generate some garbage >>>>>>>>> acc.append((cur.v, >>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>>> for >>>>>>>>> x >>>>>>>>> in >>>>>>>>> xrange(100)])) >>>>>>>>> >>>>>>>>> # if we have reached the end, chose a random link >>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>>> else >>>>>>>>> cur.next >>>>>>>>> count += 1 >>>>>>>>> >>>>>>>>> return acc >>>>>>>>> >>>>>>>>> >>>>>>>>> def build(num_elems): >>>>>>>>> start = time.time() >>>>>>>>> print "start build" >>>>>>>>> root = Node(0) >>>>>>>>> cur = root >>>>>>>>> for x in xrange(1, num_elems): >>>>>>>>> e = Node(x) >>>>>>>>> cur.next = e >>>>>>>>> cur = e >>>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>>> return root >>>>>>>>> >>>>>>>>> >>>>>>>>> num_timings = 100 >>>>>>>>> if __name__ == "__main__": >>>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>>> build(num_elems) >>>>>>>>> total = 0 >>>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>>> num_timings >>>>>>>>> runs >>>>>>>>> i = 0 >>>>>>>>> beginning = time.time() >>>>>>>>> while time.time() - beginning < 600: >>>>>>>>> start = time.time() >>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>>> assert(elem is not None) >>>>>>>>> >>>>>>>>> lst = follow(elem) >>>>>>>>> >>>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>>> something >>>>>>>>> >>>>>>>>> end = time.time() >>>>>>>>> >>>>>>>>> elapsed = end-start >>>>>>>>> timings[i % num_timings] = elapsed >>>>>>>>> if (i > num_timings): >>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>>>> defined >>>>>>>>> as >>>>>>>>>> >>>>>>>>> 2*avg run time >>>>>>>>> if (elapsed > slow_time): >>>>>>>>> print "that took a long time elapsed: %f >>>>>>>>> slow_threshold: >>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>>> (elapsed, slow_time, >>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>>> i += 1 >>>>>>>>> print total >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>>> <fijall@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>>>>>>>>> <mak@issuu.com> >>>>>>>>>> wrote: >>>>>>>>>>> Hi Armin, Maciej >>>>>>>>>>> >>>>>>>>>>> Thanks for responding. >>>>>>>>>>> >>>>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>>>> the >>>>>>>>>>> code >>>>>>>>>>> I'm >>>>>>>>>>> in a >>>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>>> >>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>>> would be >>>>>>>>>>> a >>>>>>>>>>> means >>>>>>>>>>> to >>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>>>> memory, >>>>>>>>>>> but I >>>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> /Martin >>>>>>>>>> >>>>>>>>>> Hi Martin. >>>>>>>>>> >>>>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>>>> the >>>>>>>>>> problem, >>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>>>> and >>>>>>>>>> stuff). >>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>>>> isolate >>>>>>>>>> a >>>>>>>>>> part you can share freely :) >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >> >>

Martin Koch

4:05 p.m.

Thanks :) /Martin

...

On 17/03/2014, at 16.41, Maciej Fijalkowski <fijall@gmail.com> wrote:

ok.

so as you can probably see, the max is not that big, which means the GC is really incremental. What happens is you get tons of garbage that survives minor collection every now and then. I don't exactly know why, but you should look what objects can potentially survive for too long.

...
On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch <mak@issuu.com> wrote: Ah - it just occured to me that the first runs may be slow anyway: Since we take the average of the last 100 runs as the benchmark, then the first 100 runs are not classified as slow. Indeed, the first three runs with many collections are in the first 100 runs.

...
On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch <mak@issuu.com> wrote:

Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow.

Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: gc-minor:10 gc-collect-step:247 Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: gc-minor:10 gc-collect-step:245 Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: gc-minor:11 gc-collect-step:244 Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 Max: gc-minor:17 gc-collect-step:244 Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 Max: gc-minor:11 gc-collect-step:248 Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 Max: gc-minor:8 gc-collect-step:299 Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 Max: gc-minor:11 gc-collect-step:246 Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 Max: gc-minor:36 gc-collect-step:248 Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 Max: gc-minor:38 gc-collect-step:244 Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 Max: gc-minor:23 gc-collect-step:245 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 Max: gc-minor:9 gc-collect-step:244 Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 Max: gc-minor:8 gc-collect-step:248 Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 Max: gc-minor:8 gc-collect-step:250 Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 Max: gc-minor:543 gc-collect-step:244 Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 Max: gc-minor:20 gc-collect-step:246 Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 Max: gc-minor:25 gc-collect-step:245

Thanks, /Martin

...
On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch <mak@issuu.com> wrote:

Ah. I had misunderstood. I'll get back to you on that :) thanks

/Martin

...
On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> wrote:

eh, this is not what I need

I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step)

...
On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: Here are the collated results of running each query. For each run, I count how many of each of the pypy debug lines i get. I.e. there were 668 runs that printed 58 loglines that contain "{gc-minor" which was eventually followed by "gc-minor}". I have also counted if the query was slow; interestingly, not all the queries with many gc-minors were slow (but all slow queries had a gc-minor).

Please let me know if this is unclear :)

668 gc-minor:58 gc-minor-walkroots:58 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 140 gc-minor:59 gc-minor-walkroots:59 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 gc-collect-step:9589 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 gc-collect-step:9590 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 gc-collect-step:9609 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 1 jit-log-compiling-loop:1 gc-collect-step:8991 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:9030 jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 jit-resume:14 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:84 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 gc-minor:61 jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 jit-abort:3 jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 jit-log-opt-bridge:2 jit-log-compiling-bridge:2 jit-resume:104

Thanks, /Martin

On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com> wrote: > > On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski > <fijall@gmail.com> > wrote: >> are you *sure* it's the walkroots that take that long and not >> something else (like gc-minor)? More of those mean that you allocate >> a >> lot more surviving objects. Can you do two things: >> >> a) take a max of gc-minor (and gc-minor-stackwalk), per request >> b) take the sum of those >> >> and plot them > > ^^^ or just paste the results actually > >> >>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote: >>> Well, then it works out to around 2.5GHz, which seems reasonable. >>> But >>> it >>> doesn't alter the conclusion from the previous email: The slow >>> queries >>> then >>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >>> units, or >>> .4 seconds at this conversion. Also, the log shows that a slow >>> query >>> performs many more gc-minor operations than a 'normal' one: 9600 >>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >>> >>> So the question becomes: Why do we get this large spike in >>> gc-minor-walkroots, and, in particular, is there any way to avoid >>> it :) >>> ? >>> >>> Thanks, >>> /Martin >>> >>> >>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >>> <fijall@gmail.com> >>> wrote: >>>> >>>> I think it's the cycles of your CPU >>>> >>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> >>>>> wrote: >>>>> What is the unit? Perhaps I'm being thick here, but I can't >>>>> correlate it >>>>> with seconds (which the program does print out). Slow runs are >>>>> around 13 >>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>>>> (e.g. >>>>> from >>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>>>> >>>>> >>>>> >>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>>>> <fijall@gmail.com> >>>>> wrote: >>>>>> >>>>>> The number of lines is nonsense. This is a timestamp in hex. >>>>>> >>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> >>>>>> wrote: >>>>>>> Based On Maciej's suggestion, I tried the following >>>>>>> >>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>>>> >>>>>>> This generates a logfile which looks something like this >>>>>>> >>>>>>> start--> >>>>>>> [2b99f1981b527e] {gc-minor >>>>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>>>> [2b99f19890d750] gc-minor} >>>>>>> [snip] >>>>>>> ... >>>>>>> <--stop >>>>>>> >>>>>>> >>>>>>> It turns out that the culprit is a lot of MINOR collections. >>>>>>> >>>>>>> I base this on the following observations: >>>>>>> >>>>>>> I can't understand the format of the timestamp on each logline >>>>>>> (the >>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >>>>>>> should >>>>>>> be >>>>>>> output >>>>>>> from time.clock(), but that doesn't return a number like that >>>>>>> when I >>>>>>> run >>>>>>> pypy interactively >>>>>>> Instead, I count the number of debug lines between start--> and >>>>>>> the >>>>>>> corresponding <--stop. >>>>>>> Most runs have a few hundred lines of output between start/stop >>>>>>> All slow runs have very close to 57800 lines out output between >>>>>>> start/stop >>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>>>> gc-minor >>>>>>> operations, and 9647 gc-minor-walkroots operations. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> /Martin >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>>>> <fijall@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>>>> stdout) >>>>>>>> which will do that for you btw. >>>>>>>> >>>>>>>> maybe you can find out what's that using profiling or >>>>>>>> valgrind? >>>>>>>> >>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> >>>>>>>> wrote: >>>>>>>>> I have tried getting the pypy source and building my own >>>>>>>>> version >>>>>>>>> of >>>>>>>>> pypy. I >>>>>>>>> have modified >>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>>>> to >>>>>>>>> print out when it starts and when it stops. Apparently, the >>>>>>>>> slow >>>>>>>>> queries >>>>>>>>> do >>>>>>>>> NOT occur during major_collection_step; at least, I have not >>>>>>>>> observed >>>>>>>>> major >>>>>>>>> step output during a query execution. So, apparently, >>>>>>>>> something >>>>>>>>> else >>>>>>>>> is >>>>>>>>> blocking. This could be another aspect of the GC, but it >>>>>>>>> could >>>>>>>>> also >>>>>>>>> be >>>>>>>>> anything else. >>>>>>>>> >>>>>>>>> Just to be sure, I have tried running the same application in >>>>>>>>> python >>>>>>>>> with >>>>>>>>> garbage collection disabled. I don't see the problem there, >>>>>>>>> so >>>>>>>>> it >>>>>>>>> is >>>>>>>>> somehow >>>>>>>>> related to either GC or the runtime somehow. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> /Martin >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>>>>> same >>>>>>>>>> issue. >>>>>>>>>> >>>>>>>>>> We basically generate a linked list of objects. To increase >>>>>>>>>> connectedness, >>>>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>>>> randomly >>>>>>>>>> chosen >>>>>>>>>> previous elements in the list. >>>>>>>>>> >>>>>>>>>> We then time a function that traverses 50000 elements from >>>>>>>>>> the >>>>>>>>>> list >>>>>>>>>> from a >>>>>>>>>> random start point. If the traversal reaches the end of the >>>>>>>>>> list, >>>>>>>>>> we >>>>>>>>>> instead >>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>>>>> are >>>>>>>>>> traversed >>>>>>>>>> every time. To generate some garbage, we build a list >>>>>>>>>> holding >>>>>>>>>> the >>>>>>>>>> traversed >>>>>>>>>> elements and a dummy list of characters. >>>>>>>>>> >>>>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>>>> buffer. If >>>>>>>>>> the >>>>>>>>>> elapsed time for the last run is more than twice the average >>>>>>>>>> time, >>>>>>>>>> we >>>>>>>>>> print >>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>>>>> runtime >>>>>>>>>> (we >>>>>>>>>> would like to see that the mean runtime does not increase >>>>>>>>>> with >>>>>>>>>> the >>>>>>>>>> number of >>>>>>>>>> elements in the list, but that the max time does increase >>>>>>>>>> (linearly >>>>>>>>>> with the >>>>>>>>>> number of object, i guess); traversing 50K elements should >>>>>>>>>> be >>>>>>>>>> independent of >>>>>>>>>> the memory size). >>>>>>>>>> >>>>>>>>>> We have tried monitoring memory consumption by external >>>>>>>>>> inspection, >>>>>>>>>> but >>>>>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>>>>> same >>>>>>>>>> time >>>>>>>>>> that >>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >>>>>>>>>> always >>>>>>>>>> return >>>>>>>>>> freed >>>>>>>>>> pages back to the OS? >>>>>>>>>> >>>>>>>>>> Using top, we observe that 10M elements allocates around >>>>>>>>>> 17GB >>>>>>>>>> after >>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>>>>> 35GB >>>>>>>>>> shortly >>>>>>>>>> after building). >>>>>>>>>> >>>>>>>>>> Here is output from a few runs with different number of >>>>>>>>>> elements: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> pypy mem.py 10000000 >>>>>>>>>> start build >>>>>>>>>> end build 84.142424 >>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>>>>> 1.495401 >>>>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>>>>> 1.488160 >>>>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>>>>> 1.474563 >>>>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>>>> >>>>>>>>>> pypy mem.py 20000000 >>>>>>>>>> start build >>>>>>>>>> end build 180.823105 >>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>>>>> 2.295146 >>>>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>>>>> 2.283927 >>>>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>>>>> 2.279631 >>>>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>>>> >>>>>>>>>> pypy mem.py 30000000 >>>>>>>>>> start build >>>>>>>>>> end build 276.217811 >>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>>>>> 3.188464 >>>>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>>>>> 3.183003 >>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>>>>> 3.190782 >>>>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>>>>> 3.239637 >>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>> >>>>>>>>>> Code below >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -------------------------------------------------------------- >>>>>>>>>> import time >>>>>>>>>> from random import randint, choice >>>>>>>>>> import sys >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> allElems = {} >>>>>>>>>> >>>>>>>>>> class Node: >>>>>>>>>> def __init__(self, v_): >>>>>>>>>> self.v = v_ >>>>>>>>>> self.next = None >>>>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>>>> for _ in xrange(randint(50,100))] >>>>>>>>>> allElems[self.v] = self >>>>>>>>>> if self.v > 0: >>>>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>>>> self.v-1)] >>>>>>>>>> for _ >>>>>>>>>> in >>>>>>>>>> xrange(10)] >>>>>>>>>> else: >>>>>>>>>> self.dummy_links = [self] >>>>>>>>>> >>>>>>>>>> def set_next(self, l): >>>>>>>>>> self.next = l >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> def follow(node): >>>>>>>>>> acc = [] >>>>>>>>>> count = 0 >>>>>>>>>> cur = node >>>>>>>>>> assert node.v is not None >>>>>>>>>> assert cur is not None >>>>>>>>>> while count < 50000: >>>>>>>>>> # return a value; generate some garbage >>>>>>>>>> acc.append((cur.v, >>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>>>> for >>>>>>>>>> x >>>>>>>>>> in >>>>>>>>>> xrange(100)])) >>>>>>>>>> >>>>>>>>>> # if we have reached the end, chose a random link >>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>>>> else >>>>>>>>>> cur.next >>>>>>>>>> count += 1 >>>>>>>>>> >>>>>>>>>> return acc >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> def build(num_elems): >>>>>>>>>> start = time.time() >>>>>>>>>> print "start build" >>>>>>>>>> root = Node(0) >>>>>>>>>> cur = root >>>>>>>>>> for x in xrange(1, num_elems): >>>>>>>>>> e = Node(x) >>>>>>>>>> cur.next = e >>>>>>>>>> cur = e >>>>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>>>> return root >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> num_timings = 100 >>>>>>>>>> if __name__ == "__main__": >>>>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>>>> build(num_elems) >>>>>>>>>> total = 0 >>>>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>>>> num_timings >>>>>>>>>> runs >>>>>>>>>> i = 0 >>>>>>>>>> beginning = time.time() >>>>>>>>>> while time.time() - beginning < 600: >>>>>>>>>> start = time.time() >>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>>>> assert(elem is not None) >>>>>>>>>> >>>>>>>>>> lst = follow(elem) >>>>>>>>>> >>>>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>>>> something >>>>>>>>>> >>>>>>>>>> end = time.time() >>>>>>>>>> >>>>>>>>>> elapsed = end-start >>>>>>>>>> timings[i % num_timings] = elapsed >>>>>>>>>> if (i > num_timings): >>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>>>>> defined >>>>>>>>>> as >>>>>>>>>>> >>>>>>>>>> 2*avg run time >>>>>>>>>> if (elapsed > slow_time): >>>>>>>>>> print "that took a long time elapsed: %f >>>>>>>>>> slow_threshold: >>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>>>> (elapsed, slow_time, >>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>>>> i += 1 >>>>>>>>>> print total >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>>>> <fijall@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>>>>>>>>>> <mak@issuu.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> Hi Armin, Maciej >>>>>>>>>>>> >>>>>>>>>>>> Thanks for responding. >>>>>>>>>>>> >>>>>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>>>>> the >>>>>>>>>>>> code >>>>>>>>>>>> I'm >>>>>>>>>>>> in a >>>>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>>>> >>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>>>> would be >>>>>>>>>>>> a >>>>>>>>>>>> means >>>>>>>>>>>> to >>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>>>>> memory, >>>>>>>>>>>> but I >>>>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> /Martin >>>>>>>>>>> >>>>>>>>>>> Hi Martin. >>>>>>>>>>> >>>>>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>>>>> the >>>>>>>>>>> problem, >>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>>>>> and >>>>>>>>>>> stuff). >>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>>>>> isolate >>>>>>>>>>> a >>>>>>>>>>> part you can share freely :) >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>> >>>

Maciej Fijalkowski

4:39 p.m.

not sure how more we can help without looking into the code On Mon, Mar 17, 2014 at 6:05 PM, Martin Koch <mak@issuu.com> wrote:

...

Thanks :)

/Martin

...
On 17/03/2014, at 16.41, Maciej Fijalkowski <fijall@gmail.com> wrote:

ok.

so as you can probably see, the max is not that big, which means the GC is really incremental. What happens is you get tons of garbage that survives minor collection every now and then. I don't exactly know why, but you should look what objects can potentially survive for too long.

...
On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch <mak@issuu.com> wrote: Ah - it just occured to me that the first runs may be slow anyway: Since we take the average of the last 100 runs as the benchmark, then the first 100 runs are not classified as slow. Indeed, the first three runs with many collections are in the first 100 runs.

...
On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch <mak@issuu.com> wrote:

Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow.

Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: gc-minor:10 gc-collect-step:247 Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: gc-minor:10 gc-collect-step:245 Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: gc-minor:11 gc-collect-step:244 Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 Max: gc-minor:17 gc-collect-step:244 Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 Max: gc-minor:11 gc-collect-step:248 Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 Max: gc-minor:8 gc-collect-step:299 Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 Max: gc-minor:11 gc-collect-step:246 Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 Max: gc-minor:36 gc-collect-step:248 Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 Max: gc-minor:38 gc-collect-step:244 Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 Max: gc-minor:23 gc-collect-step:245 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 Max: gc-minor:9 gc-collect-step:244 Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 Max: gc-minor:8 gc-collect-step:248 Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 Max: gc-minor:8 gc-collect-step:250 Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 Max: gc-minor:543 gc-collect-step:244 Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 Max: gc-minor:20 gc-collect-step:246 Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 Max: gc-minor:25 gc-collect-step:245

Thanks, /Martin

...
On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch <mak@issuu.com> wrote:

Ah. I had misunderstood. I'll get back to you on that :) thanks

/Martin

...
On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> wrote:

eh, this is not what I need

I need a max of TIME it took for a gc-minor and the TOTAL time it took for a gc-minor (per query) (ideally same for gc-walkroots and gc-collect-step)

> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: > Here are the collated results of running each query. For each run, I > count > how many of each of the pypy debug lines i get. I.e. there were 668 > runs > that printed 58 loglines that contain "{gc-minor" which was eventually > followed by "gc-minor}". I have also counted if the query was slow; > interestingly, not all the queries with many gc-minors were slow (but > all > slow queries had a gc-minor). > > Please let me know if this is unclear :) > > 668 gc-minor:58 gc-minor-walkroots:58 > 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 > 140 gc-minor:59 gc-minor-walkroots:59 > 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 > 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 > 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 > gc-collect-step:9589 > 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 > gc-collect-step:9590 > 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 > gc-collect-step:9609 > 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 > 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 > 1 jit-log-compiling-loop:1 gc-collect-step:8991 > jit-backend-dump:78 > jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 > gc-minor:9030 > jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 > jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 > jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 > jit-log-rewritten-loop:1 jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 > jit-resume:84 > 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 > jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 > jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 > jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 > jit-resume:14 > 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 > jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 > gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 > jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 > jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 > jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:84 > 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 > jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 > gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 > jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 > jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 > jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:84 > 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 > jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 > gc-minor:61 > jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 > jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 > jit-abort:3 > jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 > jit-log-opt-bridge:2 > jit-log-compiling-bridge:2 jit-resume:104 > > > Thanks, > /Martin > > > > On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski <fijall@gmail.com> > wrote: >> >> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >> <fijall@gmail.com> >> wrote: >>> are you *sure* it's the walkroots that take that long and not >>> something else (like gc-minor)? More of those mean that you allocate >>> a >>> lot more surviving objects. Can you do two things: >>> >>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >>> b) take the sum of those >>> >>> and plot them >> >> ^^^ or just paste the results actually >> >>> >>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote: >>>> Well, then it works out to around 2.5GHz, which seems reasonable. >>>> But >>>> it >>>> doesn't alter the conclusion from the previous email: The slow >>>> queries >>>> then >>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >>>> units, or >>>> .4 seconds at this conversion. Also, the log shows that a slow >>>> query >>>> performs many more gc-minor operations than a 'normal' one: 9600 >>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >>>> >>>> So the question becomes: Why do we get this large spike in >>>> gc-minor-walkroots, and, in particular, is there any way to avoid >>>> it :) >>>> ? >>>> >>>> Thanks, >>>> /Martin >>>> >>>> >>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >>>> <fijall@gmail.com> >>>> wrote: >>>>> >>>>> I think it's the cycles of your CPU >>>>> >>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> >>>>>> wrote: >>>>>> What is the unit? Perhaps I'm being thick here, but I can't >>>>>> correlate it >>>>>> with seconds (which the program does print out). Slow runs are >>>>>> around 13 >>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>>>>> (e.g. >>>>>> from >>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>>>>> <fijall@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> The number of lines is nonsense. This is a timestamp in hex. >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com> >>>>>>> wrote: >>>>>>>> Based On Maciej's suggestion, I tried the following >>>>>>>> >>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>>>>> >>>>>>>> This generates a logfile which looks something like this >>>>>>>> >>>>>>>> start--> >>>>>>>> [2b99f1981b527e] {gc-minor >>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>>>>> [2b99f19890d750] gc-minor} >>>>>>>> [snip] >>>>>>>> ... >>>>>>>> <--stop >>>>>>>> >>>>>>>> >>>>>>>> It turns out that the culprit is a lot of MINOR collections. >>>>>>>> >>>>>>>> I base this on the following observations: >>>>>>>> >>>>>>>> I can't understand the format of the timestamp on each logline >>>>>>>> (the >>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >>>>>>>> should >>>>>>>> be >>>>>>>> output >>>>>>>> from time.clock(), but that doesn't return a number like that >>>>>>>> when I >>>>>>>> run >>>>>>>> pypy interactively >>>>>>>> Instead, I count the number of debug lines between start--> and >>>>>>>> the >>>>>>>> corresponding <--stop. >>>>>>>> Most runs have a few hundred lines of output between start/stop >>>>>>>> All slow runs have very close to 57800 lines out output between >>>>>>>> start/stop >>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>>>>> gc-minor >>>>>>>> operations, and 9647 gc-minor-walkroots operations. >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> /Martin >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>>>>> <fijall@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>>>>> stdout) >>>>>>>>> which will do that for you btw. >>>>>>>>> >>>>>>>>> maybe you can find out what's that using profiling or >>>>>>>>> valgrind? >>>>>>>>> >>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch <mak@issuu.com> >>>>>>>>> wrote: >>>>>>>>>> I have tried getting the pypy source and building my own >>>>>>>>>> version >>>>>>>>>> of >>>>>>>>>> pypy. I >>>>>>>>>> have modified >>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>>>>> to >>>>>>>>>> print out when it starts and when it stops. Apparently, the >>>>>>>>>> slow >>>>>>>>>> queries >>>>>>>>>> do >>>>>>>>>> NOT occur during major_collection_step; at least, I have not >>>>>>>>>> observed >>>>>>>>>> major >>>>>>>>>> step output during a query execution. So, apparently, >>>>>>>>>> something >>>>>>>>>> else >>>>>>>>>> is >>>>>>>>>> blocking. This could be another aspect of the GC, but it >>>>>>>>>> could >>>>>>>>>> also >>>>>>>>>> be >>>>>>>>>> anything else. >>>>>>>>>> >>>>>>>>>> Just to be sure, I have tried running the same application in >>>>>>>>>> python >>>>>>>>>> with >>>>>>>>>> garbage collection disabled. I don't see the problem there, >>>>>>>>>> so >>>>>>>>>> it >>>>>>>>>> is >>>>>>>>>> somehow >>>>>>>>>> related to either GC or the runtime somehow. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> /Martin >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> We have hacked up a small sample that seems to exhibit the >>>>>>>>>>> same >>>>>>>>>>> issue. >>>>>>>>>>> >>>>>>>>>>> We basically generate a linked list of objects. To increase >>>>>>>>>>> connectedness, >>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>>>>> randomly >>>>>>>>>>> chosen >>>>>>>>>>> previous elements in the list. >>>>>>>>>>> >>>>>>>>>>> We then time a function that traverses 50000 elements from >>>>>>>>>>> the >>>>>>>>>>> list >>>>>>>>>>> from a >>>>>>>>>>> random start point. If the traversal reaches the end of the >>>>>>>>>>> list, >>>>>>>>>>> we >>>>>>>>>>> instead >>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>>>>>> are >>>>>>>>>>> traversed >>>>>>>>>>> every time. To generate some garbage, we build a list >>>>>>>>>>> holding >>>>>>>>>>> the >>>>>>>>>>> traversed >>>>>>>>>>> elements and a dummy list of characters. >>>>>>>>>>> >>>>>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>>>>> buffer. If >>>>>>>>>>> the >>>>>>>>>>> elapsed time for the last run is more than twice the average >>>>>>>>>>> time, >>>>>>>>>>> we >>>>>>>>>>> print >>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>>>>>> runtime >>>>>>>>>>> (we >>>>>>>>>>> would like to see that the mean runtime does not increase >>>>>>>>>>> with >>>>>>>>>>> the >>>>>>>>>>> number of >>>>>>>>>>> elements in the list, but that the max time does increase >>>>>>>>>>> (linearly >>>>>>>>>>> with the >>>>>>>>>>> number of object, i guess); traversing 50K elements should >>>>>>>>>>> be >>>>>>>>>>> independent of >>>>>>>>>>> the memory size). >>>>>>>>>>> >>>>>>>>>>> We have tried monitoring memory consumption by external >>>>>>>>>>> inspection, >>>>>>>>>>> but >>>>>>>>>>> cannot consistently verify that memory is deallocated at the >>>>>>>>>>> same >>>>>>>>>>> time >>>>>>>>>>> that >>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >>>>>>>>>>> always >>>>>>>>>>> return >>>>>>>>>>> freed >>>>>>>>>>> pages back to the OS? >>>>>>>>>>> >>>>>>>>>>> Using top, we observe that 10M elements allocates around >>>>>>>>>>> 17GB >>>>>>>>>>> after >>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>>>>>> 35GB >>>>>>>>>>> shortly >>>>>>>>>>> after building). >>>>>>>>>>> >>>>>>>>>>> Here is output from a few runs with different number of >>>>>>>>>>> elements: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> pypy mem.py 10000000 >>>>>>>>>>> start build >>>>>>>>>>> end build 84.142424 >>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>>>>>> 1.495401 >>>>>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>>>>>> 1.488160 >>>>>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>>>>>> 1.474563 >>>>>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>>>>> >>>>>>>>>>> pypy mem.py 20000000 >>>>>>>>>>> start build >>>>>>>>>>> end build 180.823105 >>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>>>>>> 2.295146 >>>>>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>>>>>> 2.283927 >>>>>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>>>>>> 2.279631 >>>>>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>>>>> >>>>>>>>>>> pypy mem.py 30000000 >>>>>>>>>>> start build >>>>>>>>>>> end build 276.217811 >>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>>>>>> 3.188464 >>>>>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>>>>>> 3.183003 >>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>>>>>> 3.190782 >>>>>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>>>>>> 3.239637 >>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>> >>>>>>>>>>> Code below >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------- >>>>>>>>>>> import time >>>>>>>>>>> from random import randint, choice >>>>>>>>>>> import sys >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> allElems = {} >>>>>>>>>>> >>>>>>>>>>> class Node: >>>>>>>>>>> def __init__(self, v_): >>>>>>>>>>> self.v = v_ >>>>>>>>>>> self.next = None >>>>>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>>>>> for _ in xrange(randint(50,100))] >>>>>>>>>>> allElems[self.v] = self >>>>>>>>>>> if self.v > 0: >>>>>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>>>>> self.v-1)] >>>>>>>>>>> for _ >>>>>>>>>>> in >>>>>>>>>>> xrange(10)] >>>>>>>>>>> else: >>>>>>>>>>> self.dummy_links = [self] >>>>>>>>>>> >>>>>>>>>>> def set_next(self, l): >>>>>>>>>>> self.next = l >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> def follow(node): >>>>>>>>>>> acc = [] >>>>>>>>>>> count = 0 >>>>>>>>>>> cur = node >>>>>>>>>>> assert node.v is not None >>>>>>>>>>> assert cur is not None >>>>>>>>>>> while count < 50000: >>>>>>>>>>> # return a value; generate some garbage >>>>>>>>>>> acc.append((cur.v, >>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>>>>> for >>>>>>>>>>> x >>>>>>>>>>> in >>>>>>>>>>> xrange(100)])) >>>>>>>>>>> >>>>>>>>>>> # if we have reached the end, chose a random link >>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>>>>> else >>>>>>>>>>> cur.next >>>>>>>>>>> count += 1 >>>>>>>>>>> >>>>>>>>>>> return acc >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> def build(num_elems): >>>>>>>>>>> start = time.time() >>>>>>>>>>> print "start build" >>>>>>>>>>> root = Node(0) >>>>>>>>>>> cur = root >>>>>>>>>>> for x in xrange(1, num_elems): >>>>>>>>>>> e = Node(x) >>>>>>>>>>> cur.next = e >>>>>>>>>>> cur = e >>>>>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>>>>> return root >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> num_timings = 100 >>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>>>>> build(num_elems) >>>>>>>>>>> total = 0 >>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>>>>> num_timings >>>>>>>>>>> runs >>>>>>>>>>> i = 0 >>>>>>>>>>> beginning = time.time() >>>>>>>>>>> while time.time() - beginning < 600: >>>>>>>>>>> start = time.time() >>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>>>>> assert(elem is not None) >>>>>>>>>>> >>>>>>>>>>> lst = follow(elem) >>>>>>>>>>> >>>>>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>>>>> something >>>>>>>>>>> >>>>>>>>>>> end = time.time() >>>>>>>>>>> >>>>>>>>>>> elapsed = end-start >>>>>>>>>>> timings[i % num_timings] = elapsed >>>>>>>>>>> if (i > num_timings): >>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>>>>>> defined >>>>>>>>>>> as >>>>>>>>>>>> >>>>>>>>>>> 2*avg run time >>>>>>>>>>> if (elapsed > slow_time): >>>>>>>>>>> print "that took a long time elapsed: %f >>>>>>>>>>> slow_threshold: >>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>>>>> (elapsed, slow_time, >>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>>>>> i += 1 >>>>>>>>>>> print total >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>>>>> <fijall@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>>>>>>>>>>> <mak@issuu.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> Hi Armin, Maciej >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks for responding. >>>>>>>>>>>>> >>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>>>>>> the >>>>>>>>>>>>> code >>>>>>>>>>>>> I'm >>>>>>>>>>>>> in a >>>>>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>>>>> >>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>>>>> would be >>>>>>>>>>>>> a >>>>>>>>>>>>> means >>>>>>>>>>>>> to >>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>>>>>> memory, >>>>>>>>>>>>> but I >>>>>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> /Martin >>>>>>>>>>>> >>>>>>>>>>>> Hi Martin. >>>>>>>>>>>> >>>>>>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>>>>>> the >>>>>>>>>>>> problem, >>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>>>>>> and >>>>>>>>>>>> stuff). >>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>>>>>> isolate >>>>>>>>>>>> a >>>>>>>>>>>> part you can share freely :) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> > >

Martin Koch

7:04 p.m.

Well, it would appear that we have the problem because we're generating a lot of garbage in the young generation, just like we're doing in the example we've been studying here. I'm unsure how we can avoid that in our real implementation. Can we force gc of the young generation? Either by gc.collect() or implcitly somehow (does the gc e.g. kick in across function calls?). Thanks, /Martin On Mon, Mar 17, 2014 at 5:39 PM, Maciej Fijalkowski <fijall@gmail.com>wrote:

...

not sure how more we can help without looking into the code

...
Thanks :)

/Martin

...
On 17/03/2014, at 16.41, Maciej Fijalkowski <fijall@gmail.com> wrote:

ok.

so as you can probably see, the max is not that big, which means the GC is really incremental. What happens is you get tons of garbage that survives minor collection every now and then. I don't exactly know why, but you should look what objects can potentially survive for too long.

...
On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch <mak@issuu.com> wrote: Ah - it just occured to me that the first runs may be slow anyway: Since we take the average of the last 100 runs as the benchmark, then the first 100 runs are not classified as slow. Indeed, the first three runs with many collections are in the first 100 runs.

...
On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch <mak@issuu.com> wrote:

Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow.

Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: gc-minor:10 gc-collect-step:247 Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: gc-minor:10 gc-collect-step:245 Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: gc-minor:11 gc-collect-step:244 Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 Max: gc-minor:17 gc-collect-step:244 Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 Max: gc-minor:11 gc-collect-step:248 Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 Max: gc-minor:8 gc-collect-step:299 Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 Max: gc-minor:11 gc-collect-step:246 Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 Max: gc-minor:36 gc-collect-step:248 Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 Max: gc-minor:38 gc-collect-step:244 Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 Max: gc-minor:23 gc-collect-step:245 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 Max: gc-minor:9 gc-collect-step:244 Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 Max: gc-minor:8 gc-collect-step:248 Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 Max: gc-minor:8 gc-collect-step:250 Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 Max: gc-minor:543 gc-collect-step:244 Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 Max: gc-minor:20 gc-collect-step:246 Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 Max: gc-minor:25 gc-collect-step:245

Thanks, /Martin

...
On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch <mak@issuu.com> wrote:

Ah. I had misunderstood. I'll get back to you on that :) thanks

/Martin

> On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> wrote: > > eh, this is not what I need > > I need a max of TIME it took for a gc-minor and the TOTAL time it took > for a gc-minor (per query) (ideally same for gc-walkroots and > gc-collect-step) > >> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> wrote: >> Here are the collated results of running each query. For each run, I >> count >> how many of each of the pypy debug lines i get. I.e. there were 668 >> runs >> that printed 58 loglines that contain "{gc-minor" which was eventually >> followed by "gc-minor}". I have also counted if the query was slow; >> interestingly, not all the queries with many gc-minors were slow (but >> all >> slow queries had a gc-minor). >> >> Please let me know if this is unclear :) >> >> 668 gc-minor:58 gc-minor-walkroots:58 >> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 >> 140 gc-minor:59 gc-minor-walkroots:59 >> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 >> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 >> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 >> gc-collect-step:9589 >> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 >> gc-collect-step:9590 >> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 >> gc-collect-step:9609 >> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 >> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 >> 1 jit-log-compiling-loop:1 gc-collect-step:8991 >> jit-backend-dump:78 >> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 >> gc-minor:9030 >> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 >> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 >> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 >> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >> jit-log-compiling-bridge:2 >> jit-resume:84 >> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 >> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 gc-minor-walkroots:60 >> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 >> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 jit-log-rewritten-loop:1 >> jit-resume:14 >> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 jit-tracing:3 >> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >> jit-abort:3 >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >> jit-log-opt-bridge:2 >> jit-log-compiling-bridge:2 jit-resume:84 >> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 >> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 jit-tracing:3 >> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 >> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >> jit-abort:3 >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >> jit-log-opt-bridge:2 >> jit-log-compiling-bridge:2 jit-resume:84 >> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 >> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 >> gc-minor:61 >> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 >> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 >> jit-abort:3 >> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 >> jit-log-opt-bridge:2 >> jit-log-compiling-bridge:2 jit-resume:104 >> >> >> Thanks, >> /Martin >> >> >> >> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski < fijall@gmail.com> >> wrote: >>> >>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >>> <fijall@gmail.com> >>> wrote: >>>> are you *sure* it's the walkroots that take that long and not >>>> something else (like gc-minor)? More of those mean that you allocate >>>> a >>>> lot more surviving objects. Can you do two things: >>>> >>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >>>> b) take the sum of those >>>> >>>> and plot them >>> >>> ^^^ or just paste the results actually >>> >>>> >>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> wrote: >>>>> Well, then it works out to around 2.5GHz, which seems reasonable. >>>>> But >>>>> it >>>>> doesn't alter the conclusion from the previous email: The slow >>>>> queries >>>>> then >>>>> all have a duration around 34*10^9 units, 'normal' queries 1*10^9 >>>>> units, or >>>>> .4 seconds at this conversion. Also, the log shows that a slow >>>>> query >>>>> performs many more gc-minor operations than a 'normal' one: 9600 >>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >>>>> >>>>> So the question becomes: Why do we get this large spike in >>>>> gc-minor-walkroots, and, in particular, is there any way to avoid >>>>> it :) >>>>> ? >>>>> >>>>> Thanks, >>>>> /Martin >>>>> >>>>> >>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >>>>> <fijall@gmail.com> >>>>> wrote: >>>>>> >>>>>> I think it's the cycles of your CPU >>>>>> >>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> >>>>>>> wrote: >>>>>>> What is the unit? Perhaps I'm being thick here, but I can't >>>>>>> correlate it >>>>>>> with seconds (which the program does print out). Slow runs are >>>>>>> around 13 >>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp units >>>>>>> (e.g. >>>>>>> from >>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>>>>>> <fijall@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> The number of lines is nonsense. This is a timestamp in hex. >>>>>>>> >>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch <mak@issuu.com

...
...
...
...
>>>>>>>> wrote: >>>>>>>>> Based On Maciej's suggestion, I tried the following >>>>>>>>> >>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>>>>>> >>>>>>>>> This generates a logfile which looks something like this >>>>>>>>> >>>>>>>>> start--> >>>>>>>>> [2b99f1981b527e] {gc-minor >>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>>>>>> [2b99f19890d750] gc-minor} >>>>>>>>> [snip] >>>>>>>>> ... >>>>>>>>> <--stop >>>>>>>>> >>>>>>>>> >>>>>>>>> It turns out that the culprit is a lot of MINOR collections. >>>>>>>>> >>>>>>>>> I base this on the following observations: >>>>>>>>> >>>>>>>>> I can't understand the format of the timestamp on each logline >>>>>>>>> (the >>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >>>>>>>>> should >>>>>>>>> be >>>>>>>>> output >>>>>>>>> from time.clock(), but that doesn't return a number like

...
...
...
...
...
>>>>>>>>> when I >>>>>>>>> run >>>>>>>>> pypy interactively >>>>>>>>> Instead, I count the number of debug lines between start--> and >>>>>>>>> the >>>>>>>>> corresponding <--stop. >>>>>>>>> Most runs have a few hundred lines of output between start/stop >>>>>>>>> All slow runs have very close to 57800 lines out output between >>>>>>>>> start/stop >>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>>>>>> gc-minor >>>>>>>>> operations, and 9647 gc-minor-walkroots operations. >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> /Martin >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>>>>>> <fijall@gmail.com> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>>>>>> stdout) >>>>>>>>>> which will do that for you btw. >>>>>>>>>> >>>>>>>>>> maybe you can find out what's that using profiling or >>>>>>>>>> valgrind? >>>>>>>>>> >>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch < mak@issuu.com> >>>>>>>>>> wrote: >>>>>>>>>>> I have tried getting the pypy source and building my own >>>>>>>>>>> version >>>>>>>>>>> of >>>>>>>>>>> pypy. I >>>>>>>>>>> have modified >>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>>>>>> to >>>>>>>>>>> print out when it starts and when it stops. Apparently,

...
...
...
...
...
>>>>>>>>>>> slow >>>>>>>>>>> queries >>>>>>>>>>> do >>>>>>>>>>> NOT occur during major_collection_step; at least, I have not >>>>>>>>>>> observed >>>>>>>>>>> major >>>>>>>>>>> step output during a query execution. So, apparently, >>>>>>>>>>> something >>>>>>>>>>> else >>>>>>>>>>> is >>>>>>>>>>> blocking. This could be another aspect of the GC, but it >>>>>>>>>>> could >>>>>>>>>>> also >>>>>>>>>>> be >>>>>>>>>>> anything else. >>>>>>>>>>> >>>>>>>>>>> Just to be sure, I have tried running the same application in >>>>>>>>>>> python >>>>>>>>>>> with >>>>>>>>>>> garbage collection disabled. I don't see the problem

...
...
...
...
...
>>>>>>>>>>> so >>>>>>>>>>> it >>>>>>>>>>> is >>>>>>>>>>> somehow >>>>>>>>>>> related to either GC or the runtime somehow. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> /Martin >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch < mak@issuu.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> We have hacked up a small sample that seems to exhibit

...
...
...
...
...
>>>>>>>>>>>> same >>>>>>>>>>>> issue. >>>>>>>>>>>> >>>>>>>>>>>> We basically generate a linked list of objects. To increase >>>>>>>>>>>> connectedness, >>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>>>>>> randomly >>>>>>>>>>>> chosen >>>>>>>>>>>> previous elements in the list. >>>>>>>>>>>> >>>>>>>>>>>> We then time a function that traverses 50000 elements from >>>>>>>>>>>> the >>>>>>>>>>>> list >>>>>>>>>>>> from a >>>>>>>>>>>> random start point. If the traversal reaches the end of

...
...
...
...
...
>>>>>>>>>>>> list, >>>>>>>>>>>> we >>>>>>>>>>>> instead >>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K elements >>>>>>>>>>>> are >>>>>>>>>>>> traversed >>>>>>>>>>>> every time. To generate some garbage, we build a list >>>>>>>>>>>> holding >>>>>>>>>>>> the >>>>>>>>>>>> traversed >>>>>>>>>>>> elements and a dummy list of characters. >>>>>>>>>>>> >>>>>>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>>>>>> buffer. If >>>>>>>>>>>> the >>>>>>>>>>>> elapsed time for the last run is more than twice the average >>>>>>>>>>>> time, >>>>>>>>>>>> we >>>>>>>>>>>> print >>>>>>>>>>>> out a line with the elapsed time, the threshold, and the 90% >>>>>>>>>>>> runtime >>>>>>>>>>>> (we >>>>>>>>>>>> would like to see that the mean runtime does not increase >>>>>>>>>>>> with >>>>>>>>>>>> the >>>>>>>>>>>> number of >>>>>>>>>>>> elements in the list, but that the max time does increase >>>>>>>>>>>> (linearly >>>>>>>>>>>> with the >>>>>>>>>>>> number of object, i guess); traversing 50K elements should >>>>>>>>>>>> be >>>>>>>>>>>> independent of >>>>>>>>>>>> the memory size). >>>>>>>>>>>> >>>>>>>>>>>> We have tried monitoring memory consumption by external >>>>>>>>>>>> inspection, >>>>>>>>>>>> but >>>>>>>>>>>> cannot consistently verify that memory is deallocated at

On Mon, Mar 17, 2014 at 6:05 PM, Martin Koch <mak@issuu.com> wrote: that the there, the the the

...
...
...
...
...
>>>>>>>>>>>> same >>>>>>>>>>>> time >>>>>>>>>>>> that >>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >>>>>>>>>>>> always >>>>>>>>>>>> return >>>>>>>>>>>> freed >>>>>>>>>>>> pages back to the OS? >>>>>>>>>>>> >>>>>>>>>>>> Using top, we observe that 10M elements allocates around >>>>>>>>>>>> 17GB >>>>>>>>>>>> after >>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and grows to >>>>>>>>>>>> 35GB >>>>>>>>>>>> shortly >>>>>>>>>>>> after building). >>>>>>>>>>>> >>>>>>>>>>>> Here is output from a few runs with different number of >>>>>>>>>>>> elements: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> pypy mem.py 10000000 >>>>>>>>>>>> start build >>>>>>>>>>>> end build 84.142424 >>>>>>>>>>>> that took a long time elapsed: 13.230586 slow_threshold: >>>>>>>>>>>> 1.495401 >>>>>>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>>>>>> that took a long time elapsed: 13.016531 slow_threshold: >>>>>>>>>>>> 1.488160 >>>>>>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>>>>>> that took a long time elapsed: 13.032537 slow_threshold: >>>>>>>>>>>> 1.474563 >>>>>>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>>>>>> >>>>>>>>>>>> pypy mem.py 20000000 >>>>>>>>>>>> start build >>>>>>>>>>>> end build 180.823105 >>>>>>>>>>>> that took a long time elapsed: 27.346064 slow_threshold: >>>>>>>>>>>> 2.295146 >>>>>>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>>>>>> that took a long time elapsed: 26.028852 slow_threshold: >>>>>>>>>>>> 2.283927 >>>>>>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>>>>>> that took a long time elapsed: 25.432279 slow_threshold: >>>>>>>>>>>> 2.279631 >>>>>>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>>>>>> >>>>>>>>>>>> pypy mem.py 30000000 >>>>>>>>>>>> start build >>>>>>>>>>>> end build 276.217811 >>>>>>>>>>>> that took a long time elapsed: 40.993855 slow_threshold: >>>>>>>>>>>> 3.188464 >>>>>>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>>>>>> that took a long time elapsed: 41.693553 slow_threshold: >>>>>>>>>>>> 3.183003 >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>> that took a long time elapsed: 39.679769 slow_threshold: >>>>>>>>>>>> 3.190782 >>>>>>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>>>>>> that took a long time elapsed: 43.573411 slow_threshold: >>>>>>>>>>>> 3.239637 >>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>> >>>>>>>>>>>> Code below >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>

...
...
...
...
...
>>>>>>>>>>>> import time >>>>>>>>>>>> from random import randint, choice >>>>>>>>>>>> import sys >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> allElems = {} >>>>>>>>>>>> >>>>>>>>>>>> class Node: >>>>>>>>>>>> def __init__(self, v_): >>>>>>>>>>>> self.v = v_ >>>>>>>>>>>> self.next = None >>>>>>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>>>>>> for _ in xrange(randint(50,100))] >>>>>>>>>>>> allElems[self.v] = self >>>>>>>>>>>> if self.v > 0: >>>>>>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>>>>>> self.v-1)] >>>>>>>>>>>> for _ >>>>>>>>>>>> in >>>>>>>>>>>> xrange(10)] >>>>>>>>>>>> else: >>>>>>>>>>>> self.dummy_links = [self] >>>>>>>>>>>> >>>>>>>>>>>> def set_next(self, l): >>>>>>>>>>>> self.next = l >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> def follow(node): >>>>>>>>>>>> acc = [] >>>>>>>>>>>> count = 0 >>>>>>>>>>>> cur = node >>>>>>>>>>>> assert node.v is not None >>>>>>>>>>>> assert cur is not None >>>>>>>>>>>> while count < 50000: >>>>>>>>>>>> # return a value; generate some garbage >>>>>>>>>>>> acc.append((cur.v, >>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>>>>>> for >>>>>>>>>>>> x >>>>>>>>>>>> in >>>>>>>>>>>> xrange(100)])) >>>>>>>>>>>> >>>>>>>>>>>> # if we have reached the end, chose a random link >>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>>>>>> else >>>>>>>>>>>> cur.next >>>>>>>>>>>> count += 1 >>>>>>>>>>>> >>>>>>>>>>>> return acc >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> def build(num_elems): >>>>>>>>>>>> start = time.time() >>>>>>>>>>>> print "start build" >>>>>>>>>>>> root = Node(0) >>>>>>>>>>>> cur = root >>>>>>>>>>>> for x in xrange(1, num_elems): >>>>>>>>>>>> e = Node(x) >>>>>>>>>>>> cur.next = e >>>>>>>>>>>> cur = e >>>>>>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>>>>>> return root >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> num_timings = 100 >>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>>>>>> build(num_elems) >>>>>>>>>>>> total = 0 >>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>>>>>> num_timings >>>>>>>>>>>> runs >>>>>>>>>>>> i = 0 >>>>>>>>>>>> beginning = time.time() >>>>>>>>>>>> while time.time() - beginning < 600: >>>>>>>>>>>> start = time.time() >>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>>>>>> assert(elem is not None) >>>>>>>>>>>> >>>>>>>>>>>> lst = follow(elem) >>>>>>>>>>>> >>>>>>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>>>>>> something >>>>>>>>>>>> >>>>>>>>>>>> end = time.time() >>>>>>>>>>>> >>>>>>>>>>>> elapsed = end-start >>>>>>>>>>>> timings[i % num_timings] = elapsed >>>>>>>>>>>> if (i > num_timings): >>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # slow >>>>>>>>>>>> defined >>>>>>>>>>>> as >>>>>>>>>>>>> >>>>>>>>>>>> 2*avg run time >>>>>>>>>>>> if (elapsed > slow_time): >>>>>>>>>>>> print "that took a long time elapsed: %f >>>>>>>>>>>> slow_threshold: >>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>>>>>> (elapsed, slow_time, >>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>>>>>> i += 1 >>>>>>>>>>>> print total >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>>>>>> <fijall@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>>>>>>>>>>>> <mak@issuu.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> Hi Armin, Maciej >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for responding. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I'm in the process of trying to determine what (if any) of >>>>>>>>>>>>>> the >>>>>>>>>>>>>> code >>>>>>>>>>>>>> I'm >>>>>>>>>>>>>> in a >>>>>>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>>>>>> would be >>>>>>>>>>>>>> a >>>>>>>>>>>>>> means >>>>>>>>>>>>>> to >>>>>>>>>>>>>> allow me to (transparently) allocate objects in unmanaged >>>>>>>>>>>>>> memory, >>>>>>>>>>>>>> but I >>>>>>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> /Martin >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Martin. >>>>>>>>>>>>> >>>>>>>>>>>>> Note that in case you want us to do the work of isolating >>>>>>>>>>>>> the >>>>>>>>>>>>> problem, >>>>>>>>>>>>> we do offer paid support to do that (then we can sign NDAs >>>>>>>>>>>>> and >>>>>>>>>>>>> stuff). >>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once you >>>>>>>>>>>>> isolate >>>>>>>>>>>>> a >>>>>>>>>>>>> part you can share freely :) >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >> >>

Maciej Fijalkowski

7:51 p.m.

no, the problem here is definitely that you're generating a lot of garbage that survives the young generation. you can always try playing with PYPY_GC_NURSERY envirtonment var (defaults to 4M I think) On Mon, Mar 17, 2014 at 9:04 PM, Martin Koch <mak@issuu.com> wrote:

...

Well, it would appear that we have the problem because we're generating a lot of garbage in the young generation, just like we're doing in the example we've been studying here. I'm unsure how we can avoid that in our real implementation. Can we force gc of the young generation? Either by gc.collect() or implcitly somehow (does the gc e.g. kick in across function calls?).

Thanks, /Martin

On Mon, Mar 17, 2014 at 5:39 PM, Maciej Fijalkowski <fijall@gmail.com> wrote:

...
not sure how more we can help without looking into the code

On Mon, Mar 17, 2014 at 6:05 PM, Martin Koch <mak@issuu.com> wrote:

...
Thanks :)

/Martin

...
On 17/03/2014, at 16.41, Maciej Fijalkowski <fijall@gmail.com> wrote:

ok.

so as you can probably see, the max is not that big, which means the GC is really incremental. What happens is you get tons of garbage that survives minor collection every now and then. I don't exactly know why, but you should look what objects can potentially survive for too long.

...
On Mon, Mar 17, 2014 at 5:37 PM, Martin Koch <mak@issuu.com> wrote: Ah - it just occured to me that the first runs may be slow anyway: Since we take the average of the last 100 runs as the benchmark, then the first 100 runs are not classified as slow. Indeed, the first three runs with many collections are in the first 100 runs.

...
On Mon, Mar 17, 2014 at 4:35 PM, Martin Koch <mak@issuu.com> wrote:

Here are the total and max times in millions of units; 30000 units is approximately 13 seconds. I have extracted the runs where there are many gc-collect-steps. These are in execution order, so the first runs with many gc-collect-steps aren't slow.

Totals: gc-minor:418 gc-minor-walkroots:0 gc-collect-step:28797 Max: gc-minor:10 gc-collect-step:247 Totals: gc-minor:562 gc-minor-walkroots:0 gc-collect-step:30282 Max: gc-minor:10 gc-collect-step:245 Totals: gc-minor:434 gc-minor-walkroots:0 gc-collect-step:31040 Max: gc-minor:11 gc-collect-step:244 Totals: gc-minor:417 slow:1 gc-minor-walkroots:0 gc-collect-step:31270 Max: gc-minor:17 gc-collect-step:244 Totals: gc-minor:435 slow:1 gc-minor-walkroots:0 gc-collect-step:30365 Max: gc-minor:11 gc-collect-step:248 Totals: gc-minor:389 slow:1 gc-minor-walkroots:0 gc-collect-step:31235 Max: gc-minor:8 gc-collect-step:299 Totals: gc-minor:434 slow:1 gc-minor-walkroots:0 gc-collect-step:31124 Max: gc-minor:11 gc-collect-step:246 Totals: gc-minor:386 slow:1 gc-minor-walkroots:0 gc-collect-step:30541 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:410 slow:1 gc-minor-walkroots:0 gc-collect-step:31427 Max: gc-minor:36 gc-collect-step:248 Totals: gc-minor:390 slow:1 gc-minor-walkroots:0 gc-collect-step:30743 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:30837 Max: gc-minor:8 gc-collect-step:244 Totals: gc-minor:412 slow:1 gc-minor-walkroots:0 gc-collect-step:30898 Max: gc-minor:38 gc-collect-step:244 Totals: gc-minor:415 slow:1 gc-minor-walkroots:0 gc-collect-step:30407 Max: gc-minor:23 gc-collect-step:245 Totals: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30591 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:387 slow:1 gc-minor-walkroots:0 gc-collect-step:31193 Max: gc-minor:9 gc-collect-step:244 Totals: gc-minor:379 slow:1 gc-minor-walkroots:0 gc-collect-step:30026 Max: gc-minor:8 gc-collect-step:246 Totals: gc-minor:388 slow:1 gc-minor-walkroots:0 gc-collect-step:31179 Max: gc-minor:8 gc-collect-step:248 Totals: gc-minor:378 slow:1 gc-minor-walkroots:0 gc-collect-step:30674 Max: gc-minor:8 gc-collect-step:250 Totals: gc-minor:385 slow:1 gc-minor-walkroots:0 gc-collect-step:30413 Max: gc-minor:8 gc-collect-step:245 Totals: gc-minor:915 slow:1 gc-minor-walkroots:0 gc-collect-step:30830 Max: gc-minor:543 gc-collect-step:244 Totals: gc-minor:405 slow:1 gc-minor-walkroots:0 gc-collect-step:31153 Max: gc-minor:20 gc-collect-step:246 Totals: gc-minor:408 slow:1 gc-minor-walkroots:0 gc-collect-step:29815 Max: gc-minor:25 gc-collect-step:245

Thanks, /Martin

> On Mon, Mar 17, 2014 at 3:24 PM, Martin Koch <mak@issuu.com> wrote: > > Ah. I had misunderstood. I'll get back to you on that :) thanks > > /Martin > > >> On 17/03/2014, at 15.21, Maciej Fijalkowski <fijall@gmail.com> >> wrote: >> >> eh, this is not what I need >> >> I need a max of TIME it took for a gc-minor and the TOTAL time it >> took >> for a gc-minor (per query) (ideally same for gc-walkroots and >> gc-collect-step) >> >>> On Mon, Mar 17, 2014 at 4:19 PM, Martin Koch <mak@issuu.com> >>> wrote: >>> Here are the collated results of running each query. For each run, >>> I >>> count >>> how many of each of the pypy debug lines i get. I.e. there were >>> 668 >>> runs >>> that printed 58 loglines that contain "{gc-minor" which was >>> eventually >>> followed by "gc-minor}". I have also counted if the query was >>> slow; >>> interestingly, not all the queries with many gc-minors were slow >>> (but >>> all >>> slow queries had a gc-minor). >>> >>> Please let me know if this is unclear :) >>> >>> 668 gc-minor:58 gc-minor-walkroots:58 >>> 10 gc-minor:58 gc-minor-walkroots:58 gc-collect-step:5 >>> 140 gc-minor:59 gc-minor-walkroots:59 >>> 1 gc-minor:8441 gc-minor-walkroots:8441 gc-collect-step:8403 >>> 1 gc-minor:9300 gc-minor-walkroots:9300 gc-collect-step:9249 >>> 9 gc-minor:9643 slow:1 gc-minor-walkroots:9643 >>> gc-collect-step:9589 >>> 1 gc-minor:9644 slow:1 gc-minor-walkroots:9644 >>> gc-collect-step:9590 >>> 10 gc-minor:9647 slow:1 gc-minor-walkroots:9647 >>> gc-collect-step:9609 >>> 1 gc-minor:9663 gc-minor-walkroots:9663 gc-collect-step:9614 >>> 1 jit-backend-dump:5 gc-minor:58 gc-minor-walkroots:58 >>> 1 jit-log-compiling-loop:1 gc-collect-step:8991 >>> jit-backend-dump:78 >>> jit-backend:3 jit-log-noopt-loop:6 jit-log-virtualstate:3 >>> gc-minor:9030 >>> jit-tracing:3 gc-minor-walkroots:9030 jit-optimize:6 >>> jit-log-short-preamble:2 jit-backend-addr:3 jit-log-opt-loop:1 >>> jit-mem-looptoken-alloc:3 jit-abort:3 jit-log-rewritten-bridge:2 >>> jit-log-rewritten-loop:1 jit-log-opt-bridge:2 >>> jit-log-compiling-bridge:2 >>> jit-resume:84 >>> 1 jit-log-compiling-loop:1 jit-backend-dump:13 jit-backend:1 >>> jit-log-noopt-loop:2 gc-minor:60 jit-tracing:1 >>> gc-minor-walkroots:60 >>> jit-optimize:2 jit-log-short-preamble:1 jit-backend-addr:1 >>> jit-log-opt-loop:1 jit-mem-looptoken-alloc:1 >>> jit-log-rewritten-loop:1 >>> jit-resume:14 >>> 1 jit-log-compiling-loop:1 jit-backend-dump:73 jit-backend:3 >>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:60 >>> jit-tracing:3 >>> gc-minor-walkroots:60 jit-optimize:6 jit-log-short-preamble:2 >>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >>> jit-abort:3 >>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >>> jit-log-opt-bridge:2 >>> jit-log-compiling-bridge:2 jit-resume:84 >>> 2 jit-log-compiling-loop:1 jit-backend-dump:78 jit-backend:3 >>> jit-log-noopt-loop:6 jit-log-virtualstate:3 gc-minor:61 >>> jit-tracing:3 >>> gc-minor-walkroots:61 jit-optimize:6 jit-log-short-preamble:2 >>> jit-backend-addr:3 jit-log-opt-loop:1 jit-mem-looptoken-alloc:3 >>> jit-abort:3 >>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:1 >>> jit-log-opt-bridge:2 >>> jit-log-compiling-bridge:2 jit-resume:84 >>> 1 jit-log-short-preamble:2 jit-log-compiling-loop:2 >>> jit-backend-dump:92 jit-log-noopt-loop:7 jit-log-virtualstate:3 >>> gc-minor:61 >>> jit-tracing:4 gc-minor-walkroots:61 jit-optimize:7 jit-backend:4 >>> jit-backend-addr:4 jit-log-opt-loop:2 jit-mem-looptoken-alloc:4 >>> jit-abort:3 >>> jit-log-rewritten-bridge:2 jit-log-rewritten-loop:2 >>> jit-log-opt-bridge:2 >>> jit-log-compiling-bridge:2 jit-resume:104 >>> >>> >>> Thanks, >>> /Martin >>> >>> >>> >>> On Mon, Mar 17, 2014 at 2:23 PM, Maciej Fijalkowski >>> <fijall@gmail.com> >>> wrote: >>>> >>>> On Mon, Mar 17, 2014 at 3:20 PM, Maciej Fijalkowski >>>> <fijall@gmail.com> >>>> wrote: >>>>> are you *sure* it's the walkroots that take that long and not >>>>> something else (like gc-minor)? More of those mean that you >>>>> allocate >>>>> a >>>>> lot more surviving objects. Can you do two things: >>>>> >>>>> a) take a max of gc-minor (and gc-minor-stackwalk), per request >>>>> b) take the sum of those >>>>> >>>>> and plot them >>>> >>>> ^^^ or just paste the results actually >>>> >>>>> >>>>>> On Mon, Mar 17, 2014 at 3:18 PM, Martin Koch <mak@issuu.com> >>>>>> wrote: >>>>>> Well, then it works out to around 2.5GHz, which seems >>>>>> reasonable. >>>>>> But >>>>>> it >>>>>> doesn't alter the conclusion from the previous email: The slow >>>>>> queries >>>>>> then >>>>>> all have a duration around 34*10^9 units, 'normal' queries >>>>>> 1*10^9 >>>>>> units, or >>>>>> .4 seconds at this conversion. Also, the log shows that a slow >>>>>> query >>>>>> performs many more gc-minor operations than a 'normal' one: >>>>>> 9600 >>>>>> gc-collect-step/gc-minor/gc-minor-walkroots operations vs 58. >>>>>> >>>>>> So the question becomes: Why do we get this large spike in >>>>>> gc-minor-walkroots, and, in particular, is there any way to >>>>>> avoid >>>>>> it :) >>>>>> ? >>>>>> >>>>>> Thanks, >>>>>> /Martin >>>>>> >>>>>> >>>>>> On Mon, Mar 17, 2014 at 1:53 PM, Maciej Fijalkowski >>>>>> <fijall@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> I think it's the cycles of your CPU >>>>>>> >>>>>>>> On Mon, Mar 17, 2014 at 2:48 PM, Martin Koch <mak@issuu.com> >>>>>>>> wrote: >>>>>>>> What is the unit? Perhaps I'm being thick here, but I can't >>>>>>>> correlate it >>>>>>>> with seconds (which the program does print out). Slow runs >>>>>>>> are >>>>>>>> around 13 >>>>>>>> seconds, but are around 34*10^9(dec), 0x800000000 timestamp >>>>>>>> units >>>>>>>> (e.g. >>>>>>>> from >>>>>>>> 0x2b994c9d31889c to 0x2b9944ab8c4f49). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Mar 17, 2014 at 12:09 PM, Maciej Fijalkowski >>>>>>>> <fijall@gmail.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> The number of lines is nonsense. This is a timestamp in hex. >>>>>>>>> >>>>>>>>> On Mon, Mar 17, 2014 at 12:46 PM, Martin Koch >>>>>>>>> <mak@issuu.com> >>>>>>>>> wrote: >>>>>>>>>> Based On Maciej's suggestion, I tried the following >>>>>>>>>> >>>>>>>>>> PYPYLOG=- pypy mem.py 10000000 > out >>>>>>>>>> >>>>>>>>>> This generates a logfile which looks something like this >>>>>>>>>> >>>>>>>>>> start--> >>>>>>>>>> [2b99f1981b527e] {gc-minor >>>>>>>>>> [2b99f1981ba680] {gc-minor-walkroots >>>>>>>>>> [2b99f1981c2e02] gc-minor-walkroots} >>>>>>>>>> [2b99f19890d750] gc-minor} >>>>>>>>>> [snip] >>>>>>>>>> ... >>>>>>>>>> <--stop >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> It turns out that the culprit is a lot of MINOR >>>>>>>>>> collections. >>>>>>>>>> >>>>>>>>>> I base this on the following observations: >>>>>>>>>> >>>>>>>>>> I can't understand the format of the timestamp on each >>>>>>>>>> logline >>>>>>>>>> (the >>>>>>>>>> "[2b99f1981b527e]"). From what I can see in the code, this >>>>>>>>>> should >>>>>>>>>> be >>>>>>>>>> output >>>>>>>>>> from time.clock(), but that doesn't return a number like >>>>>>>>>> that >>>>>>>>>> when I >>>>>>>>>> run >>>>>>>>>> pypy interactively >>>>>>>>>> Instead, I count the number of debug lines between start--> >>>>>>>>>> and >>>>>>>>>> the >>>>>>>>>> corresponding <--stop. >>>>>>>>>> Most runs have a few hundred lines of output between >>>>>>>>>> start/stop >>>>>>>>>> All slow runs have very close to 57800 lines out output >>>>>>>>>> between >>>>>>>>>> start/stop >>>>>>>>>> One such sample does 9609 gc-collect-step operations, 9647 >>>>>>>>>> gc-minor >>>>>>>>>> operations, and 9647 gc-minor-walkroots operations. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> /Martin >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Mar 17, 2014 at 8:21 AM, Maciej Fijalkowski >>>>>>>>>> <fijall@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> there is an environment variable PYPYLOG=gc:- (where - is >>>>>>>>>>> stdout) >>>>>>>>>>> which will do that for you btw. >>>>>>>>>>> >>>>>>>>>>> maybe you can find out what's that using profiling or >>>>>>>>>>> valgrind? >>>>>>>>>>> >>>>>>>>>>> On Sun, Mar 16, 2014 at 11:34 PM, Martin Koch >>>>>>>>>>> <mak@issuu.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> I have tried getting the pypy source and building my own >>>>>>>>>>>> version >>>>>>>>>>>> of >>>>>>>>>>>> pypy. I >>>>>>>>>>>> have modified >>>>>>>>>>>> rpython/memory/gc/incminimark.py:major_collection_step() >>>>>>>>>>>> to >>>>>>>>>>>> print out when it starts and when it stops. Apparently, >>>>>>>>>>>> the >>>>>>>>>>>> slow >>>>>>>>>>>> queries >>>>>>>>>>>> do >>>>>>>>>>>> NOT occur during major_collection_step; at least, I have >>>>>>>>>>>> not >>>>>>>>>>>> observed >>>>>>>>>>>> major >>>>>>>>>>>> step output during a query execution. So, apparently, >>>>>>>>>>>> something >>>>>>>>>>>> else >>>>>>>>>>>> is >>>>>>>>>>>> blocking. This could be another aspect of the GC, but it >>>>>>>>>>>> could >>>>>>>>>>>> also >>>>>>>>>>>> be >>>>>>>>>>>> anything else. >>>>>>>>>>>> >>>>>>>>>>>> Just to be sure, I have tried running the same >>>>>>>>>>>> application in >>>>>>>>>>>> python >>>>>>>>>>>> with >>>>>>>>>>>> garbage collection disabled. I don't see the problem >>>>>>>>>>>> there, >>>>>>>>>>>> so >>>>>>>>>>>> it >>>>>>>>>>>> is >>>>>>>>>>>> somehow >>>>>>>>>>>> related to either GC or the runtime somehow. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> /Martin >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 14, 2014 at 4:19 PM, Martin Koch >>>>>>>>>>>> <mak@issuu.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> We have hacked up a small sample that seems to exhibit >>>>>>>>>>>>> the >>>>>>>>>>>>> same >>>>>>>>>>>>> issue. >>>>>>>>>>>>> >>>>>>>>>>>>> We basically generate a linked list of objects. To >>>>>>>>>>>>> increase >>>>>>>>>>>>> connectedness, >>>>>>>>>>>>> elements in the list hold references (dummy_links) to 10 >>>>>>>>>>>>> randomly >>>>>>>>>>>>> chosen >>>>>>>>>>>>> previous elements in the list. >>>>>>>>>>>>> >>>>>>>>>>>>> We then time a function that traverses 50000 elements >>>>>>>>>>>>> from >>>>>>>>>>>>> the >>>>>>>>>>>>> list >>>>>>>>>>>>> from a >>>>>>>>>>>>> random start point. If the traversal reaches the end of >>>>>>>>>>>>> the >>>>>>>>>>>>> list, >>>>>>>>>>>>> we >>>>>>>>>>>>> instead >>>>>>>>>>>>> traverse one of the dummy links. Thus, exactly 50K >>>>>>>>>>>>> elements >>>>>>>>>>>>> are >>>>>>>>>>>>> traversed >>>>>>>>>>>>> every time. To generate some garbage, we build a list >>>>>>>>>>>>> holding >>>>>>>>>>>>> the >>>>>>>>>>>>> traversed >>>>>>>>>>>>> elements and a dummy list of characters. >>>>>>>>>>>>> >>>>>>>>>>>>> Timings for the last 100 runs are stored in a circular >>>>>>>>>>>>> buffer. If >>>>>>>>>>>>> the >>>>>>>>>>>>> elapsed time for the last run is more than twice the >>>>>>>>>>>>> average >>>>>>>>>>>>> time, >>>>>>>>>>>>> we >>>>>>>>>>>>> print >>>>>>>>>>>>> out a line with the elapsed time, the threshold, and the >>>>>>>>>>>>> 90% >>>>>>>>>>>>> runtime >>>>>>>>>>>>> (we >>>>>>>>>>>>> would like to see that the mean runtime does not >>>>>>>>>>>>> increase >>>>>>>>>>>>> with >>>>>>>>>>>>> the >>>>>>>>>>>>> number of >>>>>>>>>>>>> elements in the list, but that the max time does >>>>>>>>>>>>> increase >>>>>>>>>>>>> (linearly >>>>>>>>>>>>> with the >>>>>>>>>>>>> number of object, i guess); traversing 50K elements >>>>>>>>>>>>> should >>>>>>>>>>>>> be >>>>>>>>>>>>> independent of >>>>>>>>>>>>> the memory size). >>>>>>>>>>>>> >>>>>>>>>>>>> We have tried monitoring memory consumption by external >>>>>>>>>>>>> inspection, >>>>>>>>>>>>> but >>>>>>>>>>>>> cannot consistently verify that memory is deallocated at >>>>>>>>>>>>> the >>>>>>>>>>>>> same >>>>>>>>>>>>> time >>>>>>>>>>>>> that >>>>>>>>>>>>> we see slow requests. Perhaps the pypy runtime doesn't >>>>>>>>>>>>> always >>>>>>>>>>>>> return >>>>>>>>>>>>> freed >>>>>>>>>>>>> pages back to the OS? >>>>>>>>>>>>> >>>>>>>>>>>>> Using top, we observe that 10M elements allocates around >>>>>>>>>>>>> 17GB >>>>>>>>>>>>> after >>>>>>>>>>>>> building, 20M elements 26GB, 30M elements 28GB (and >>>>>>>>>>>>> grows to >>>>>>>>>>>>> 35GB >>>>>>>>>>>>> shortly >>>>>>>>>>>>> after building). >>>>>>>>>>>>> >>>>>>>>>>>>> Here is output from a few runs with different number of >>>>>>>>>>>>> elements: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> pypy mem.py 10000000 >>>>>>>>>>>>> start build >>>>>>>>>>>>> end build 84.142424 >>>>>>>>>>>>> that took a long time elapsed: 13.230586 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 1.495401 >>>>>>>>>>>>> 90th_quantile_runtime: 0.421558 >>>>>>>>>>>>> that took a long time elapsed: 13.016531 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 1.488160 >>>>>>>>>>>>> 90th_quantile_runtime: 0.423441 >>>>>>>>>>>>> that took a long time elapsed: 13.032537 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 1.474563 >>>>>>>>>>>>> 90th_quantile_runtime: 0.419817 >>>>>>>>>>>>> >>>>>>>>>>>>> pypy mem.py 20000000 >>>>>>>>>>>>> start build >>>>>>>>>>>>> end build 180.823105 >>>>>>>>>>>>> that took a long time elapsed: 27.346064 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 2.295146 >>>>>>>>>>>>> 90th_quantile_runtime: 0.434726 >>>>>>>>>>>>> that took a long time elapsed: 26.028852 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 2.283927 >>>>>>>>>>>>> 90th_quantile_runtime: 0.374190 >>>>>>>>>>>>> that took a long time elapsed: 25.432279 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 2.279631 >>>>>>>>>>>>> 90th_quantile_runtime: 0.371502 >>>>>>>>>>>>> >>>>>>>>>>>>> pypy mem.py 30000000 >>>>>>>>>>>>> start build >>>>>>>>>>>>> end build 276.217811 >>>>>>>>>>>>> that took a long time elapsed: 40.993855 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 3.188464 >>>>>>>>>>>>> 90th_quantile_runtime: 0.459891 >>>>>>>>>>>>> that took a long time elapsed: 41.693553 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 3.183003 >>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>>> that took a long time elapsed: 39.679769 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 3.190782 >>>>>>>>>>>>> 90th_quantile_runtime: 0.393677 >>>>>>>>>>>>> that took a long time elapsed: 43.573411 >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> 3.239637 >>>>>>>>>>>>> 90th_quantile_runtime: 0.393654 >>>>>>>>>>>>> >>>>>>>>>>>>> Code below >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -------------------------------------------------------------- >>>>>>>>>>>>> import time >>>>>>>>>>>>> from random import randint, choice >>>>>>>>>>>>> import sys >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> allElems = {} >>>>>>>>>>>>> >>>>>>>>>>>>> class Node: >>>>>>>>>>>>> def __init__(self, v_): >>>>>>>>>>>>> self.v = v_ >>>>>>>>>>>>> self.next = None >>>>>>>>>>>>> self.dummy_data = [randint(0,100) >>>>>>>>>>>>> for _ in >>>>>>>>>>>>> xrange(randint(50,100))] >>>>>>>>>>>>> allElems[self.v] = self >>>>>>>>>>>>> if self.v > 0: >>>>>>>>>>>>> self.dummy_links = [allElems[randint(0, >>>>>>>>>>>>> self.v-1)] >>>>>>>>>>>>> for _ >>>>>>>>>>>>> in >>>>>>>>>>>>> xrange(10)] >>>>>>>>>>>>> else: >>>>>>>>>>>>> self.dummy_links = [self] >>>>>>>>>>>>> >>>>>>>>>>>>> def set_next(self, l): >>>>>>>>>>>>> self.next = l >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> def follow(node): >>>>>>>>>>>>> acc = [] >>>>>>>>>>>>> count = 0 >>>>>>>>>>>>> cur = node >>>>>>>>>>>>> assert node.v is not None >>>>>>>>>>>>> assert cur is not None >>>>>>>>>>>>> while count < 50000: >>>>>>>>>>>>> # return a value; generate some garbage >>>>>>>>>>>>> acc.append((cur.v, >>>>>>>>>>>>> [choice("abcdefghijklmnopqrstuvwxyz") >>>>>>>>>>>>> for >>>>>>>>>>>>> x >>>>>>>>>>>>> in >>>>>>>>>>>>> xrange(100)])) >>>>>>>>>>>>> >>>>>>>>>>>>> # if we have reached the end, chose a random link >>>>>>>>>>>>> cur = choice(cur.dummy_links) if cur.next is None >>>>>>>>>>>>> else >>>>>>>>>>>>> cur.next >>>>>>>>>>>>> count += 1 >>>>>>>>>>>>> >>>>>>>>>>>>> return acc >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> def build(num_elems): >>>>>>>>>>>>> start = time.time() >>>>>>>>>>>>> print "start build" >>>>>>>>>>>>> root = Node(0) >>>>>>>>>>>>> cur = root >>>>>>>>>>>>> for x in xrange(1, num_elems): >>>>>>>>>>>>> e = Node(x) >>>>>>>>>>>>> cur.next = e >>>>>>>>>>>>> cur = e >>>>>>>>>>>>> print "end build %f" % (time.time() - start) >>>>>>>>>>>>> return root >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> num_timings = 100 >>>>>>>>>>>>> if __name__ == "__main__": >>>>>>>>>>>>> num_elems = int(sys.argv[1]) >>>>>>>>>>>>> build(num_elems) >>>>>>>>>>>>> total = 0 >>>>>>>>>>>>> timings = [0.0] * num_timings # run times for the last >>>>>>>>>>>>> num_timings >>>>>>>>>>>>> runs >>>>>>>>>>>>> i = 0 >>>>>>>>>>>>> beginning = time.time() >>>>>>>>>>>>> while time.time() - beginning < 600: >>>>>>>>>>>>> start = time.time() >>>>>>>>>>>>> elem = allElems[randint(0, num_elems - 1)] >>>>>>>>>>>>> assert(elem is not None) >>>>>>>>>>>>> >>>>>>>>>>>>> lst = follow(elem) >>>>>>>>>>>>> >>>>>>>>>>>>> total += choice(lst)[0] # use the return value for >>>>>>>>>>>>> something >>>>>>>>>>>>> >>>>>>>>>>>>> end = time.time() >>>>>>>>>>>>> >>>>>>>>>>>>> elapsed = end-start >>>>>>>>>>>>> timings[i % num_timings] = elapsed >>>>>>>>>>>>> if (i > num_timings): >>>>>>>>>>>>> slow_time = 2 * sum(timings)/num_timings # >>>>>>>>>>>>> slow >>>>>>>>>>>>> defined >>>>>>>>>>>>> as >>>>>>>>>>>>>> >>>>>>>>>>>>> 2*avg run time >>>>>>>>>>>>> if (elapsed > slow_time): >>>>>>>>>>>>> print "that took a long time elapsed: %f >>>>>>>>>>>>> slow_threshold: >>>>>>>>>>>>> %f 90th_quantile_runtime: %f" % \ >>>>>>>>>>>>> (elapsed, slow_time, >>>>>>>>>>>>> sorted(timings)[int(num_timings*.9)]) >>>>>>>>>>>>> i += 1 >>>>>>>>>>>>> print total >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Mar 13, 2014 at 7:45 PM, Maciej Fijalkowski >>>>>>>>>>>>> <fijall@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Mar 13, 2014 at 1:45 PM, Martin Koch >>>>>>>>>>>>>> <mak@issuu.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> Hi Armin, Maciej >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks for responding. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I'm in the process of trying to determine what (if >>>>>>>>>>>>>>> any) of >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> code >>>>>>>>>>>>>>> I'm >>>>>>>>>>>>>>> in a >>>>>>>>>>>>>>> position to share, and I'll get back to you. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Allowing hinting to the GC would be good. Even better >>>>>>>>>>>>>>> would be >>>>>>>>>>>>>>> a >>>>>>>>>>>>>>> means >>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> allow me to (transparently) allocate objects in >>>>>>>>>>>>>>> unmanaged >>>>>>>>>>>>>>> memory, >>>>>>>>>>>>>>> but I >>>>>>>>>>>>>>> would expect that to be a tall order :) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> /Martin >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Martin. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Note that in case you want us to do the work of >>>>>>>>>>>>>> isolating >>>>>>>>>>>>>> the >>>>>>>>>>>>>> problem, >>>>>>>>>>>>>> we do offer paid support to do that (then we can sign >>>>>>>>>>>>>> NDAs >>>>>>>>>>>>>> and >>>>>>>>>>>>>> stuff). >>>>>>>>>>>>>> Otherwise we would be more than happy to fix bugs once >>>>>>>>>>>>>> you >>>>>>>>>>>>>> isolate >>>>>>>>>>>>>> a >>>>>>>>>>>>>> part you can share freely :) >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>> >>>

Carl Friedrich Bolz

8:47 a.m.

On 17/03/14 20:04, Martin Koch wrote:

...

Well, it would appear that we have the problem because we're generating a lot of garbage in the young generation, just like we're doing in the example we've been studying here.

No, I think it's because your generating a lot of garbage in the *old* generation. Meaning objects which survive one minor collection but then die.

...

I'm unsure how we can avoid that in our real implementation. Can we force gc of the young generation? Either by gc.collect() or implcitly somehow (does the gc e.g. kick in across function calls?).

That would make matters worse, because increasing the frequency of minor collects means *more* objects get moved to the old generation (where they cause problems). So indeed, maybe in your case making the new generation bigger might help. This can be done using PYPY_GC_NURSERY, I think (nursery is the space reserved for young objects). The risk is that minor collections become unreasonably slow. Anyway, if the example code you gave us also shows the problem I think we should eventually look into it. It's not really fair to say "but you're allocating too much!" to explain why the GC takes a lot of time. Cheers, Carl Friedrich

Martin Koch

9:37 a.m.

Thanks, Carl. This bit of code certainly exhibits the surprising property that some runs unpredictably stall for a very long time. Further, it seems that this stall time can be made arbitrarily large by increasing the number of nodes generated (== more data in the old generation == more stuff to traverse if lots of garbage is generated and survives the young generation?). As a user of an incremental garbage collector, I would expect that there are pauses due to GC, but that these are predictable and small. I tried running PYPY_GC_NURSERY=2000M pypy ./mem.py 10000000 but that seemed to have no effect. I'm looking forward to the results of the Software Transactional Memory, btw :) /Martin On Tue, Mar 18, 2014 at 9:47 AM, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:

...

On 17/03/14 20:04, Martin Koch wrote:

...
Well, it would appear that we have the problem because we're generating a lot of garbage in the young generation, just like we're doing in the example we've been studying here.

No, I think it's because your generating a lot of garbage in the *old* generation. Meaning objects which survive one minor collection but then die.

...
I'm unsure how we can avoid that in our real implementation. Can we force gc of the young generation? Either by gc.collect() or implcitly somehow (does the gc e.g. kick in across function calls?).

That would make matters worse, because increasing the frequency of minor collects means *more* objects get moved to the old generation (where they cause problems). So indeed, maybe in your case making the new generation bigger might help. This can be done using PYPY_GC_NURSERY, I think (nursery is the space reserved for young objects). The risk is that minor collections become unreasonably slow.

Anyway, if the example code you gave us also shows the problem I think we should eventually look into it. It's not really fair to say "but you're allocating too much!" to explain why the GC takes a lot of time.

Cheers,

Carl Friedrich _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev

Carl Friedrich Bolz

10:23 a.m.

Agreed, somehow this should not happen. Anyway, I'm not the person to look into this, but I filed a bug, so at least your example code does not get lost: https://bugs.pypy.org/issue1710 Cheers, Carl Friedrich Martin Koch <mak@issuu.com> wrote:

...

Thanks, Carl.

This bit of code certainly exhibits the surprising property that some runs unpredictably stall for a very long time. Further, it seems that this stall time can be made arbitrarily large by increasing the number of nodes generated (== more data in the old generation == more stuff to traverse if lots of garbage is generated and survives the young generation?). As a user of an incremental garbage collector, I would expect that there are pauses due to GC, but that these are predictable and small.

I tried running

PYPY_GC_NURSERY=2000M pypy ./mem.py 10000000

but that seemed to have no effect.

I'm looking forward to the results of the Software Transactional Memory, btw :)

/Martin

On Tue, Mar 18, 2014 at 9:47 AM, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:

...
...
Well, it would appear that we have the problem because we're generating a lot of garbage in the young generation, just like we're doing in

On 17/03/14 20:04, Martin Koch wrote: the

...
example we've been studying here.

No, I think it's because your generating a lot of garbage in the *old* generation. Meaning objects which survive one minor collection but then die.

...
I'm unsure how we can avoid that in our real implementation. Can we force gc of the young generation? Either by gc.collect() or implcitly somehow (does the gc e.g. kick in across function calls?).

That would make matters worse, because increasing the frequency of minor collects means *more* objects get moved to the old generation (where they cause problems). So indeed, maybe in your case making the new generation bigger might help. This can be done using PYPY_GC_NURSERY, I think (nursery is the space reserved for young objects). The risk is that minor collections become unreasonably slow.

Anyway, if the example code you gave us also shows the problem I think we should eventually look into it. It's not really fair to say "but you're allocating too much!" to explain why the GC takes a lot of time.

Cheers,

Carl Friedrich _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev

Carl Friedrich

Martin Koch

10:55 a.m.

Thanks, Carl I think that the part of the mail thread with the timing measurements that show that it is many gc-collect-steps and not one single major gc is also relevant for the bug, so that this information won't have to be rediscovered whenever someone gets the time to look at the bug :) I.e. that is the mail with lines like this one: *Totals*: gc-minor:380 slow:1 gc-minor-walkroots:0 gc-collect-step:30207 *Max*: gc-minor:8 gc-collect-step:245 It might also be relevant to include info on the command line to reproduce the problem: pypy mem.py 10000000 Thanks, /Martin On Tue, Mar 18, 2014 at 11:23 AM, Carl Friedrich Bolz <cfbolz@gmx.de> wrote:

...

Agreed, somehow this should not happen.

Anyway, I'm not the person to look into this, but I filed a bug, so at least your example code does not get lost:

https://bugs.pypy.org/issue1710

Cheers,

Carl Friedrich

Martin Koch <mak@issuu.com> wrote:

...
Thanks, Carl.

This bit of code certainly exhibits the surprising property that some runs unpredictably stall for a very long time. Further, it seems that this stall time can be made arbitrarily large by increasing the number of nodes generated (== more data in the old generation == more stuff to traverse if lots of garbage is generated and survives the young generation?). As a user of an incremental garbage collector, I would expect that there are pauses due to GC, but that these are predictable and small.

I tried running

PYPY_GC_NURSERY=2000M pypy ./mem.py 10000000

but that seemed to have no effect.

I'm looking forward to the results of the Software Transactional Memory, btw :)

/Martin

On Tue, Mar 18, 2014 at 9:47 AM, Carl Friedrich Bolz <cfbolz@gmx.de>wrote:

...
On 17/03/14 20:04, Martin Koch wrote:

...
Well, it would appear that we have the problem because we're generating a lot of garbage in the young generation, just like we're doing in the example we've been studying here.

No, I think it's because your generating a lot of garbage in the *old* generation. Meaning objects which survive one minor collection but then die.

...
I'm unsure how we can avoid that in our real implementation. Can we force gc of the young generation? Either by gc.collect() or implcitly somehow (does the gc e.g. kick in across function calls?).

That would make matters worse, because increasing the frequency of minor collects means *more* objects get moved to the old generation (where they cause problems). So indeed, maybe in your case making the new generation bigger might help. This can be done using PYPY_GC_NURSERY, I think (nursery is the space reserved for young objects). The risk is that minor collections become unreasonably slow.

Anyway, if the example code you gave us also shows the problem I think we should eventually look into it. It's not really fair to say "but you're allocating too much!" to explain why the GC takes a lot of time.

Cheers,

Carl Friedrich _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev

Carl Friedrich

Bengt Richter

10:41 a.m.

PMJI, but I wonder if some of these objects could come from trivial re-instatiations instead of re-use of mutable objects, e.g., fishing out one attribute to use together with a new value as init values for an (unnecessarily) new obj. obj = ObjClass(obj.someattr, chgval) where obj.chg = chgval would have done the job without creating garbage. I suspect this pattern can happen more subtly than above, especially if __new__ is defined to do something tricky with old instances. Also, creating a new object can be a tempting way to feel sure about its complete state, without having to write a custom (re)init method. On 03/18/2014 10:37 AM Martin Koch wrote:

...

Thanks, Carl.

This bit of code certainly exhibits the surprising property that some runs unpredictably stall for a very long time. Further, it seems that this stall time can be made arbitrarily large by increasing the number of nodes generated (== more data in the old generation == more stuff to traverse if lots of garbage is generated and survives the young generation?). As a user of an incremental garbage collector, I would expect that there are pauses due to GC, but that these are predictable and small.

I tried running

PYPY_GC_NURSERY=2000M pypy ./mem.py 10000000

but that seemed to have no effect.

I'm looking forward to the results of the Software Transactional Memory, btw :)

/Martin

On Tue, Mar 18, 2014 at 9:47 AM, Carl Friedrich Bolz<cfbolz@gmx.de> wrote:

...
On 17/03/14 20:04, Martin Koch wrote:

...
Well, it would appear that we have the problem because we're generating a lot of garbage in the young generation, just like we're doing in the example we've been studying here.

No, I think it's because your generating a lot of garbage in the *old* generation. Meaning objects which survive one minor collection but then die.

...
I'm unsure how we can avoid that in our real implementation. Can we force gc of the young generation? Either by gc.collect() or implcitly somehow (does the gc e.g. kick in across function calls?).

That would make matters worse, because increasing the frequency of minor collects means *more* objects get moved to the old generation (where they cause problems). So indeed, maybe in your case making the new generation bigger might help. This can be done using PYPY_GC_NURSERY, I think (nursery is the space reserved for young objects). The risk is that minor collections become unreasonably slow.

Anyway, if the example code you gave us also shows the problem I think we should eventually look into it. It's not really fair to say "but you're allocating too much!" to explain why the GC takes a lot of time.

Cheers,

Carl Friedrich _______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev

_______________________________________________ pypy-dev mailing list pypy-dev@python.org https://mail.python.org/mailman/listinfo/pypy-dev

4011

Age (days ago)

4017

Last active (days ago)

List overview

Download

29 comments

5 participants

participants (5)

Armin Rigo
Bengt Richter
Carl Friedrich Bolz
Maciej Fijalkowski
Martin Koch

Pypy garbage collection

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

Martin Koch

tags

participants (5)