I was just looking through the code that handles frames (as part of my current effort to determine how to improve on CPython's performance), when I noticed the freelist/zombie mechanism for frame allocation handling. While the zombie mechanism seems like a nice optimization, I believe there can be a couple of improvements. Currently, each code object has a zombie frame that last executed it. This zombie is reused when that code object is re-executed in a frame. When a frame is released, it is reassigned as the zombie of the code, and iff the code object already has a zombie assigned to it, it places the frame in the freelist. If I understand correctly, this means, that the "freelist" is actually only ever used for recursive-call frames that were released. It also means that these released frames will be reassigned to other code objects in the future - in which case they will be reallocated, perhaps unnecessarily. "freelist" is just temporary storage for released recursive calls. A program with no recursions will always have an empty freelist. What is bounding the memory consumption of this mechanism is a limit on the number of frames in the freelist (and the fact that there is a limited number of code objects, each of which may have an additional zombie frame). I believe a better way to implement this mechanism: A. Replace the co_zombie frame with a co_zombie_freelist. B. Remove the global freelist. In other words, have a free list for each code object, rather than one-per-code-object and a freelist. This can be memory-bound by limiting the freelist size in each code object. This can be use a bit more memory if a recursion is called just once - and then discarded (waste for example 10 frames instead of 1), but can save a lot of realloc calls when there is more than one recursion used in the same program. It is also possible to substantially increase the number of frames stored per code-object, and then use some kind of more sophisticated aging mechanism on the zombie freelists to periodically get rid of unused freelists. That kind of mechanism would mean that even in the case of recursive calls, virtually all frame allocs are available from the freelist. I also believe this to be somewhat simpler, as there is only one concept (a zombie freelist) rather than 2 (a zombie code object and a freelist for recursive calls), and frames are never realloc'd, but only allocated. Should I make a patch?
Should I make a patch?
-1. This could consume quite a lot of memory, and I doubt that the speed improvement would be significant. Instead, I would check whether performance could be improved by just dropping the freelist. Looking at the code, I see that it performs a realloc (!) of the frame object if the one it found is too small. That surely must be expensive, and should be replaced with a free/malloc pair instead. I'd be curious to see whether malloc on today's systems is still so slow as to justify a free list. If it is, I would propose to create multiple free lists per size classes, e.g. for frames with 10, 20, 30, etc. variables, rather than having free lists per code object. Regards, Martin
The freelist currently serves as a good optimization of a special case
of a recurring recursion. If the same code object (or one of the same
size) is used for recursive calls repeatedly, the freelist will
realloc-to-same-size (which probably has no serious cost) and thus the
cost of allocating/deallocating frames was averted.
I think that in general, the complexity of a sophisticated and
efficient aging mechanism is not worth it just to optimize recursive
calls. The only question is whether it is truly a memory problem, if
using, say, up-to 50 frames per code object?
Note that _only_ recursions will have more than 1 frame attached.
How many recursions are used and then discarded?
How slow is it to constantly malloc/free frames in a recursion?
My proposal will accelerate the following example:
def f(x, y):
if 0 == x: return
f(x-1, y)
def g(x):
if 0 == x: return
g(x-1)
while True:
f(100, 100)
g(100)
The current implementation will work well with the following:
while True:
f(100, 100)
But removing freelist altogether will not work well with any type of recursion.
Eyal
On 6/10/07, "Martin v. Löwis"
Should I make a patch?
-1. This could consume quite a lot of memory, and I doubt that the speed improvement would be significant. Instead, I would check whether performance could be improved by just dropping the freelist. Looking at the code, I see that it performs a realloc (!) of the frame object if the one it found is too small. That surely must be expensive, and should be replaced with a free/malloc pair instead.
I'd be curious to see whether malloc on today's systems is still so slow as to justify a free list. If it is, I would propose to create multiple free lists per size classes, e.g. for frames with 10, 20, 30, etc. variables, rather than having free lists per code object.
Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/eyal.lotem%40gmail.com
Note that _only_ recursions will have more than 1 frame attached.
That's not true; in the presence of threads, the same method may also be invoked more than one time simultaneously.
But removing freelist altogether will not work well with any type of recursion.
How do you know that? Did you measure the time? On what system? What were the results? Performance optimization without measuring is just unacceptable. Regards, Martin
On 6/10/07, "Martin v. Löwis"
Note that _only_ recursions will have more than 1 frame attached.
That's not true; in the presence of threads, the same method may also be invoked more than one time simultaneously.
Yes, I have missed that, and realized that I missed that myself a bit later. I guess I can rationalize that with the fact that I myself tend to avoid threads.
But removing freelist altogether will not work well with any type of recursion.
How do you know that? Did you measure the time? On what system? What were the results?
Performance optimization without measuring is just unacceptable.
I agree, I may have used the wrong tone above. Removing the freelist will probably either not have a significant effect (at worst, its adding very little work of maintaining it), or improve recursions and functions that tend to be running simultaniously in multiple threads (as in those cases the realloc will not actually resize the frame, and mallocs/free will indeed be saved). But do note my corrected tone (I said "probably" :-) - and anyone is welcome to try removing it and see if they get a performance benefit. The fact threading also causes the same code object to be used in multiple frames makes everything a little less predictable and may mean that having a larger-than-1 number of frames associated with each code object may indeed yield a performance benefit. I am not sure how to benchmark such modifications. Is there any benchmark that includes threaded use of the same functions in typical use cases?
Regards, Martin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/eyal.lotem%40gmail.com
I am not sure how to benchmark such modifications. Is there any benchmark that includes threaded use of the same functions in typical use cases?
I don't think it's necessary to benchmark that specific case - *any* kind of micro-benchmark would be better than none. If you want to introduce free lists per code object, you need to benchmark such code, and compare it to the status quo. While doing so, I'd ask to also measure the case that the free list is dropped without a replacement. Regards, Martin
participants (2)
-
"Martin v. Löwis"
-
Eyal Lotem