[Twisted-Python] Performance issue in reactor.callLater
Hi everyone! I'm running a network simulator on top of Twisted and that made me realize a heavy performance bottleneck. When I'm simulating a larger number of network nodes each of which uses two or three delayed calls, that adds up and then adding and cancelling calls becomes /really/ expensive. We are talking about ~ one millisecond per add and even a few milliseconds per cancel. The problem is the implementation that uses a linear list. A heap would be /much/ more appropriate. But especially the cancel could be spead up by factors by not actually removing the object but just setting a flag that it was canceled. The canceled DelayedCall objects would then be filtered out when looking for the next delay. I already implemented that in my simulator when I was trying to eliminate the calls to dc.cancel(). Works perfectly well, but I'd rather see Twisted scale better itself. Are there any objections to replacing the current implementation? Otherwise I'd just send a patch to the list as soon as I have integrated my version into the respective classes. Stefan
On Mon, 2004-09-06 at 09:43, Stefan Behnel wrote:
Are there any objections to replacing the current implementation?
Not if yours is faster :) I would like to see some benchmarks comparing speed of adding and cancelling for cases with many scheduled calls, but also for cases with few calls, so we can see if your patches slow down other cases. -- Itamar Shtull-Trauring http://itamarst.org
Itamar Shtull-Trauring schrieb:
I would like to see some benchmarks comparing speed of adding and cancelling for cases with many scheduled calls, but also for cases with few calls, so we can see if your patches slow down other cases.
For the first part, I did a few profiler runs that showed results similar to the following. Please note that although I removed a few lines that only belong to my personal code, the top three lines actually /were/ the top three. The "AckManager" lines are a simplified form of my own implementation, see it as a second level callLater. The __run_callbacks method is so slow because it calls "reactor.callLater" to be run repeatedly. --------------------------------------- 9577183 function calls (9577108 primitive calls) in 117.349 CPU seconds Ordered by: internal time, call count ncalls tottime percall cumtime percall filename:lineno(function) 506487 22.837 0.000 30.519 0.000 bisect.py:3(insort_right) 50830 12.002 0.000 103.742 0.002 base.py:400(runUntilCurrent) 506487 8.974 0.000 47.078 0.000 base.py:362(callLater) 498759 8.865 0.000 66.300 0.000 AckManager.py:81(__run_callbacks) 50830 8.700 0.000 8.911 0.000 default.py:472(doSelect) 3922829 7.682 0.000 7.682 0.000 base.py:132(__lt__) 506487 7.585 0.000 7.585 0.000 base.py:54(__init__) 500901 5.998 0.000 5.998 0.000 AckManager.py:59(__remove_inactive_entries_from_queue) 500181 5.953 0.000 51.499 0.000 AckManager.py:68(__schedule_run) 1 2.463 2.463 117.348 117.348 default.py:127(mainLoop) 50830 2.232 0.000 2.232 0.000 base.py:391(timeout) --------------------------------------- I think it is actually easy to see here that bisect becomes a serious bottleneck when the list gets longer. I didn't include any results from cancel() as I immediately eliminated all calls to the method after my first profiler run. Believe me, it was absolutely worth it. I'll just continue reimplementing the IReactorTime stuff and then re-run the profiler for you. I'll also try to run a few test cases with cancel() - thanks to CVS, I should still have some old code around... Stefan
Itamar Shtull-Trauring schrieb:
I would like to see some benchmarks comparing speed of adding and cancelling for cases with many scheduled calls, but also for cases with few calls, so we can see if your patches slow down other cases.
So here's a comparison. I had to take a different example to actually see a difference since the profiling run I sent last time was about the same complexity. Original list-based version: ------------------------------------------------------------- 11647647 function calls (11647337 primitive calls) in 153.122 CPU seconds Ordered by: internal time, call count ncalls tottime percall cumtime percall filename:lineno(function) 48733 5.322 0.000 8.469 0.000 bisect.py:3(insort_right) 595267 3.147 0.000 3.147 0.000 base.py:132(__lt__) 48733 1.754 0.000 11.545 0.000 base.py:362(callLater) 48733 1.323 0.000 1.323 0.000 base.py:54(__init__) 30 1.232 0.041 151.715 5.057 base.py:400(runUntilCurrent) 3171 0.134 0.000 3.309 0.001 AckManager.py:81(__run_callbacks) 6801 0.067 0.000 0.426 0.000 AckManager.py:68(__schedule_run) 1 0.042 0.042 0.042 0.042 base.py:71(cancel) 30 0.000 0.000 0.000 0.000 base.py:391(timeout) ------------------------------------------------------------- My heap-based version: ------------------------------------------------------------- 11726288 function calls (11725973 primitive calls) in 159.055 CPU seconds Ordered by: internal time, call count ncalls tottime percall cumtime percall filename:lineno(function) 41752 4.097 0.000 5.420 0.000 heapq.py:226(_siftup) 24 2.199 0.092 156.848 6.535 base.py:417(runUntilCurrent) 89610 1.603 0.000 1.779 0.000 heapq.py:174(_siftdown) 572391 1.212 0.000 1.212 0.000 base.py:134(__le__) 47858 0.697 0.000 2.778 0.000 base.py:363(callLater) 47858 0.327 0.000 0.327 0.000 base.py:56(__init__) 41752 0.288 0.000 5.707 0.000 heapq.py:136(heappop) 47858 0.262 0.000 1.754 0.000 heapq.py:131(heappush) 2902 0.150 0.000 4.824 0.002 AckManager.py:81(__run_callbacks) 6365 0.070 0.000 0.171 0.000 AckManager.py:68(__schedule_run) 24 0.000 0.000 0.000 0.000 base.py:407(timeout) 24 0.000 0.000 0.000 0.000 base.py:375(_cleanUpCallLater) 6 0.000 0.000 0.000 0.000 base.py:72(cancel) ------------------------------------------------------------- What I found is that callLater becomes /much/ faster in my implementation, while runUntilCurrent /does/ get slower. But that's Ok, I guess, since there is at least one call to callLater for each iteration in runUntilCurrent. So, if there's more calls to callLater, it pays off. I didn't test the cancel() part very well yet, but since the method calls should be almost instantaneous now, I guess, the speed up there will be absolutely increadible. Do not forget, also, that calls to callLater, delay and reset can occurr multiple times before a call is actually executed, so I guess it is /really/ the right thing if runUntilCurrent does more work than the others. In the first run, there happens to be only one call to cancel(), but that one already takes ages compared to the six calls in the second run. I hope this is somewhat convincing. Could anyone else please do some testing to see if there are other cases I've overseen? Stefan
Hi! I wondered if the Constructor parameters of DelayedCall were important elsewhere. Maybe they should stay the same - I don't know. So for your convenience, here's two complete patches, one that removes the "reset" and "cancel" parameters from the constructor and one that doesn't. It doesn't actually change much to leave them in, it only moves a part of the work from runUntilCurrent back into the cancel and reset methods. I figure you'd rarely see a difference, so you may as well leave them in. Have fun, Stefan
On Sep 6, 2004, at 9:58 AM, Itamar Shtull-Trauring wrote:
On Mon, 2004-09-06 at 09:43, Stefan Behnel wrote:
Are there any objections to replacing the current implementation?
Not if yours is faster :) I would like to see some benchmarks comparing speed of adding and cancelling for cases with many scheduled calls, but also for cases with few calls, so we can see if your patches slow down other cases.
There's also the fact that this: def infiniteLoop(): reactor.callLater(0.0, infiniteLoop) will just destroy Twisted because the current loop iteration will never finish. In my non-Twisted implementation of the same concept, during each iteration I keep a list of new timers, which gets merged into the heap at the end of each iteration (so that they will be in the heap for the next round). I'm not sure I understand the logic behind _cleanUpCallLater in the patch? It doesn't seem right, especially for delayed, but I didn't really read very carefully. -bob
Bob Ippolito schrieb:
There's also the fact that this:
def infiniteLoop(): reactor.callLater(0.0, infiniteLoop)
will just destroy Twisted because the current loop iteration will never finish. In my non-Twisted implementation of the same concept, during each iteration I keep a list of new timers, which gets merged into the heap at the end of each iteration (so that they will be in the heap for the next round).
I think that's the way to go. That's actually easy to do, here's a patch against my implementation that does exactly that. It moves a bit more work into runUntilCurrent, but I think I already made my point in that regard.
I'm not sure I understand the logic behind _cleanUpCallLater in the patch? It doesn't seem right, especially for delayed, but I didn't really read very carefully.
In the cancelled case, it just pops the cancelled call - done (and obviously correct). In the delayed case it pops the changed call (that possibly temporarily brakes the heap invariant) and replaces it with itself, so it actually re-inserts the call at the right position. Though, ... I guess I do see a problem if the value of a call somewhere in the heap is changed. That brakes the heap invariant and may conflict with further changes to the heap. Let me think about it. Stefan
Stefan Behnel schrieb:
I guess I do see a problem if the value of a call somewhere in the heap is changed. That brakes the heap invariant and may conflict with further changes to the heap. Let me think about it.
Ok, I thought about it. And I believe the best way to do this is to write a specialized heap version. I copied the heapq implementation and adapted it so that it keeps track of the position in the heap for each DelayedCall. That makes it possible to remove a call without linear overhead and in log(N) time. So, comparing the complexity of the heap version with the original version: callLater: ORIG: O(log(N)) for binary search + O(N) for array insertion (python lists are arrays as far as I know) HEAP: O(1) as it is only appended to a list. Insertion is done in runUntilCurrent (as late as possible) cancel: ORIG: O(N) for linear array deletion HEAP: O(log(N)) if it was already inserted, O(1) if it was only stored for later insertion (see post by Bob) reset: ORIG: O(N) for linear array deletion + O(log(N)) for binary search + O(N) for array insertion HEAP: O(log(N)) for removal + O(log(N)) for insertion if it was already inserted O(1) if it was only stored delay: see reset - could actually be implemented twice as fast in the heap (only _siftup), but I didn't do that - it's fast enough anyway runUntilCurrent: heap implementation is about log(N) times slower, which is still much faster than the original implementaion if you add it up with callLater. There is always one call to callLater and at most (!) one iteration in runUntilCurrent for each DelayedCall. So, summing it up, it kicks ass. Other things to mention: getDelayCalls has a slight change in semantics: It only returns the delayed calls that were already inserted into the heap, so if you run callLater and getDelayCalls in the same iteration of the main loop, you will not find your delayed call in the returned tuple. This is due to the change Bob proposed. Now, are there any test cases in Twisted for the ReactorTime part? I'd like to see them run over my implementation to check if it works as expected in all cases. You should also consider running pychecker over the old version, there were some unused variables and that kind of stuff. Hoping for approval, Stefan
Sorry for spamming - here's a slightly cleaned up patch that does not include some left overs - no changes in behaviour. Stefan
On Tue, 2004-09-07 at 05:17, Stefan Behnel wrote:
Sorry for spamming - here's a slightly cleaned up patch that does not include some left overs - no changes in behaviour.
Thanks - add it to twistedmatrix.com/bugs/ and assign to me.
Stefan Behnel wrote:
Stefan Behnel schrieb:
I guess I do see a problem if the value of a call somewhere in the heap is changed. That brakes the heap invariant and may conflict with further changes to the heap. Let me think about it.
Ok, I thought about it. And I believe the best way to do this is to write a specialized heap version. I copied the heapq implementation and adapted it so that it keeps track of the position in the heap for each DelayedCall. That makes it possible to remove a call without linear overhead and in log(N) time. [snip]
heapq.py is under the PSF license. PSF is roughly similar to MIT. Is this going to be okay for inclusion in Twisted? Jp
Jp Calderone <exarkun@divmod.com> writes:
Stefan Behnel wrote:
Stefan Behnel schrieb:
I guess I do see a problem if the value of a call somewhere in the heap is changed. That brakes the heap invariant and may conflict with further changes to the heap. Let me think about it. Ok, I thought about it. And I believe the best way to do this is to write a specialized heap version. I copied the heapq implementation and adapted it so that it keeps track of the position in the heap for each DelayedCall. That makes it possible to remove a call without linear overhead and in log(N) time. [snip]
heapq.py is under the PSF license. PSF is roughly similar to MIT. Is this going to be okay for inclusion in Twisted?
It's also one of the newer, and hence less complicated in this respect, bits of Python. You can probably just get permission from whoever wrote it... # Original code by Kevin O'Connor, augmented by Tim Peters and Raymond Hettinger it seems. Cheers, mwh -- "Also, does the simple algorithm you used in Cyclops have a name?" "Not officially, but it answers to "hey, dumb-ass!" -- Neil Schemenauer and Tim Peters, 23 Feb 2001
On Tue, 2004-09-07 at 13:07, Jp Calderone wrote:
heapq.py is under the PSF license. PSF is roughly similar to MIT. Is this going to be okay for inclusion in Twisted?
It's part of Python, so I don't see why that would be necessary. However, I have profiled various heapq-based schemes for callLater in the past, and they always end up losing pretty big in the average case, and winning only in perverse cases such as hundreds of callLater(0)s per tick. I saw a little discussion of the performance impact, but I am still not entirely convinced that it's a good idea.
On Tue, 2004-09-07 at 14:13, Glyph Lefkowitz wrote:
heapq.py is under the PSF license. PSF is roughly similar to MIT. Is this going to be okay for inclusion in Twisted?
It's part of Python, so I don't see why that would be necessary.
This is patched / somehow different version.
However, I have profiled various heapq-based schemes for callLater in the past, and they always end up losing pretty big in the average case, and winning only in perverse cases such as hundreds of callLater(0)s per tick. I saw a little discussion of the performance impact, but I am still not entirely convinced that it's a good idea.
Obviously we wouldn't commit without further performance testing.
On Sep 7, 2004, at 2:28 PM, Itamar Shtull-Trauring wrote:
On Tue, 2004-09-07 at 14:13, Glyph Lefkowitz wrote:
heapq.py is under the PSF license. PSF is roughly similar to MIT. Is this going to be okay for inclusion in Twisted?
It's part of Python, so I don't see why that would be necessary.
This is patched / somehow different version.
However, I have profiled various heapq-based schemes for callLater in the past, and they always end up losing pretty big in the average case, and winning only in perverse cases such as hundreds of callLater(0)s per tick. I saw a little discussion of the performance impact, but I am still not entirely convinced that it's a good idea.
Obviously we wouldn't commit without further performance testing.
What is the "average" case anyway? For what application? Where are the performance tests coming from? There is at least one scenario where it makes perfect sense to do a lot of callLater(0.0, ...), when you're doing computation but trying not to block. Though, of course, you would do something like callLater(0.00001, ...) in the current Twisted, because callLater(0.0, ...) pretty much blocks. -bob
Bob Ippolito schrieb:
heapq.py is under the PSF license. PSF is roughly similar to MIT. Is this going to be okay for inclusion in Twisted?
It's part of Python, so I don't see why that would be necessary.
This is patched / somehow different version.
Right, it needs to keep track of the array position where a DelayedCall is stored in order to remove them as quickly as possible.
However, I have profiled various heapq-based schemes for callLater in the past, and they always end up losing pretty big in the average case, and winning only in perverse cases such as hundreds of callLater(0)s per tick. I saw a little discussion of the performance impact, but I am still not entirely convinced that it's a good idea.
What is the "average" case anyway? For what application? Where are the performance tests coming from?
Yes, I think that's a problem. Now, as far as I see it, the implementation makes the case where there are a lot of interleaved callLater/cancel/reset calls /much/ faster. And in a case where there is only a few calls it shouldn't matter anyway. Just to repeat myself: what really get's slower is runUntilCurrent, but the gain in callLater should compensate that. So, for the case where each callLater leads to a delayed call, the performance gain should be marginal (my patch already avoids shifting the list in linear time) and in all cases where there are more calls to callLater, reset and cancel (added up) than launched delayed calls, I expect the heap implementation to be faster by orders of magnitude.
There is at least one scenario where it makes perfect sense to do a lot of callLater(0.0, ...), when you're doing computation but trying not to block.
Though, of course, you would do something like callLater(0.00001, ...) in the current Twisted, because callLater(0.0, ...) pretty much blocks.
What you may gain in that case is that it is no longer necessary to shift the complete list when adding a new call, so there already is a shift from linear time to log(N). And again: if there is only a few calls to callLater then this is not the place to look for performance bottlenecks anyway. But that is up to the application programmer. Stefan BTW: I went to http://www.twistedmatrix.com/bugs, but I can't find the obvious link saying "file a bug report HERE". So I registered, but I don't seem to get a reply - and I can't login either. Looks like the web server is running with the /old/ version of Twisted... :o)
Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> writes:
BTW: I went to http://www.twistedmatrix.com/bugs, but I can't find the obvious link saying "file a bug report HERE". So I registered, but I don't seem to get a reply - and I can't login either. Looks like the web server is running with the /old/ version of Twisted... :o)
You probably will get a reply, eventually -- at least that's what happened to me. Took about 8 hours, IIRC. Cheers, mwh -- That's why the smartest companies use Common Lisp, but lie about it so all their competitors think Lisp is slow and C++ is fast. (This rumor has, however, gotten a little out of hand. :) -- Erik Naggum, comp.lang.lisp
On Wed, 2004-09-08 at 03:58, Stefan Behnel wrote:
Just to repeat myself: what really get's slower is runUntilCurrent, but the gain in callLater should compensate that. So, for the case where each callLater leads to a delayed call, the performance gain should be marginal (my patch already avoids shifting the list in linear time) and in all cases where there are more calls to callLater, reset and cancel (added up) than launched delayed calls, I expect the heap implementation to be faster by orders of magnitude.
runUntilCurrent happens every tick; callLater only happens sometimes. Rather than say "average" applications: In my understanding, this will speed up applications that have lots of timed events scheduled at the expense of applications that transfer data over the network. Anthony, if you're listening, do you have a profiling setup that could give us some authoritative numbers?
Glyph Lefkowitz schrieb:
runUntilCurrent happens every tick; callLater only happens sometimes.
Calling the method "runUntilCurrent", yes. But that's not important. What makes a difference is that each iteration of the loop in runUntilCurrent is called at most once per callLater, so there is never fewer calls to callLater than iterations in runUntilCurrent - but in many cases there will be more calls to callLater, which gets even worse if you also count calls to cancel and reset. "runUntilCurrent" really is the right place to make the trade-off.
Rather than say "average" applications: In my understanding, this will speed up applications that have lots of timed events scheduled at the expense of applications that transfer data over the network.
Just to make myself a bit clearer: runUntilCurrent only gets slower /per delayed call/. So if there are no delayed calls there should be no (measurable) slow down. But if there are delayed calls you will get the speed up. Stefan
On Sep 8, 2004, at 10:37 AM, Stefan Behnel wrote:
Just to make myself a bit clearer: runUntilCurrent only gets slower /per delayed call/. So if there are no delayed calls there should be no (measurable) slow down. But if there are delayed calls you will get the speed up.
Sounds like the right tradeoff to me. However, I'm unconvinced about the reasoning in:
Ok, I thought about it. And I believe the best way to do this is to write a specialized heap version. I copied the heapq implementation and adapted it so that it keeps track of the position in the heap for each DelayedCall. That makes it possible to remove a call without linear overhead and in log(N) time.
To me, it seems like the right thing is what I think you had before: 1) on cancel, mark it canceled 2) on delay, mark it delayed 3) when about to run a DelayedCall, if it is cancelled, ignore it. If it is delayed, reinsert it into the heap in the proper place. Why did you change to the current method from that? james
James Y Knight schrieb:
However, I'm unconvinced about the reasoning in:
Ok, I thought about it. And I believe the best way to do this is to write a specialized heap version. I copied the heapq implementation and adapted it so that it keeps track of the position in the heap for each DelayedCall. That makes it possible to remove a call without linear overhead and in log(N) time.
To me, it seems like the right thing is what I think you had before: 1) on cancel, mark it canceled
That works. However, in cases with many pending delayed calls leaving the cancelled ones on the heap may unnecessarily increase the heap size quite a bit, so I figured it might be better to keep the size low. But I admit that this is an undermotivated decision. I personally had the case of an ACK tracker where my first approach was to start a timeout for each ACK and then cancel it. But when I saw the desastrous performance of the original Twisted implementation I quickly wrote my own queue for it. Anyway, in such a case you'd have a lot of cancelled delayed calls on the heap and they'd unnecessarily increase the time for new insertions (which is done, as I said, as late as possible, but before the run-loop). I played with the idea of giving cancel() a new boolean parameter "quick_cancel" that would skip the removal step if set to True. All you'd need to (re-)add in that case would be the check in the runUntilCurrent loop if the current call was cancelled. That may be a good compromise.
2) on delay, mark it delayed
That doesn't work.
3) when about to run a DelayedCall, if it is cancelled, ignore it. If it is delayed, reinsert it into the heap in the proper place.
Why did you change to the current method from that?
Delaying a call breaks the heap invariant and therefore may hinder new insertions. There is a simple work around for delay: Store the delay time somewhere in the object and set it when we find the call as first item. /Then/ move it to the right position. This lessens the number of reinsertions if the call is delayed more often. Could be worth it. However, this doesn't work for reset. Reset accepts an arbitrary time value that may be lower than the original. A work around here could be to correct the heap position in either direction when the time value changes, but that is about as much work as removing and reinserting it. I chose the current solution as I figured it would be a good trade-off between code complexity, special cases and performance. The quick_cancel thing is easy to implement (4 lines), though. Could I have other opinions on this? Stefan
James Y Knight schrieb:
To me, it seems like the right thing is what I think you had before: 1) on cancel, mark it canceled
Sorry, I do not think that that is a good idea anymore. The reason is: 1) If we remove the entry during the call to cancel(), it takes log(N) time and is only done once. 2) If we do /not/ remove the entry, cancel() takes constant time (set a flag), but runUntilCurrent has to deal with a larger stack on each iteration. So this may slow down things considerably if many delayed calls are cancelled and remain on the stack. Here the right place for the trade-off is in cancel(). The patch I finally submitted as a bug report does 1). Stefan
On Sep 9, 2004, at 5:39 AM, Stefan Behnel wrote:
1) If we remove the entry during the call to cancel(), it takes log(N) time and is only done once.
2) If we do /not/ remove the entry, cancel() takes constant time (set a flag), but runUntilCurrent has to deal with a larger stack on each iteration. So this may slow down things considerably if many delayed calls are cancelled and remain on the stack.
I'm thinking of something like: - When you call cancel, set the cancelled flag, and increment a counter of cancelled items. - in runUntilCurrent, if the number of cancelled items in the list is both > 50 and > 1/2 of the total items, filter them out and re-heapify the list. Now, the only thing left that needs the heap_pos is moving a delayedcall to a sooner time. This is an unlikely event, and thus we should not be overly concerned about its speed. Now we can use the builtin heapq, which is written in C in Python 2.4. Patch attached to bug. James
participants (7)
-
Bob Ippolito
-
Glyph Lefkowitz
-
Itamar Shtull-Trauring
-
James Y Knight
-
Jp Calderone
-
Michael Hudson
-
Stefan Behnel