tiny optimization in ceval mainloop
I noticed that one frequently executed line in the mainloop is testing whether either the ticker has dropped to 0 or if there are things_to_do. Would it be kosher to just drop the ticker to 0 whenever things_to_do is set to true? Then you'd only need to check the ticker each time through. Jeremy Index: ceval.c =================================================================== RCS file: /cvsroot/python/python/dist/src/Python/ceval.c,v retrieving revision 2.332 diff -c -c -r2.332 ceval.c *** ceval.c 23 Aug 2002 14:11:35 -0000 2.332 --- ceval.c 29 Aug 2002 23:19:18 -0000 *************** *** 378,383 **** --- 378,384 ---- int Py_AddPendingCall(int (*func)(void *), void *arg) { + PyThreadState *tstate; static int busy = 0; int i, j; /* XXX Begin critical section */ *************** *** 395,400 **** --- 396,404 ---- pendingcalls[i].func = func; pendingcalls[i].arg = arg; pendinglast = j; + + tstate = PyThreadState_GET(); + tstate->ticker = 0; things_to_do = 1; /* Signal main loop */ busy = 0; /* XXX End critical section */ *************** *** 669,675 **** async I/O handler); see Py_AddPendingCall() and Py_MakePendingCalls() above. */ ! if (things_to_do || --tstate->ticker < 0) { tstate->ticker = tstate->interp->checkinterval; if (things_to_do) { if (Py_MakePendingCalls() < 0) { --- 673,679 ---- async I/O handler); see Py_AddPendingCall() and Py_MakePendingCalls() above. */ ! if (--tstate->ticker < 0) { tstate->ticker = tstate->interp->checkinterval; if (things_to_do) { if (Py_MakePendingCalls() < 0) {
I noticed that one frequently executed line in the mainloop is testing whether either the ticker has dropped to 0 or if there are things_to_do. Would it be kosher to just drop the ticker to 0 whenever things_to_do is set to true? Then you'd only need to check the ticker each time through.
I think not -- Py_AddPendingCall() is supposed to be called from interrupts and other low-level stuff, where you don't know whose thread state you would get. Too bad, it was a nice idea. --Guido van Rossum (home page: http://www.python.org/~guido/)
Would it be kosher to just drop the ticker to 0 whenever things_to_do is set to true?
I think not -- Py_AddPendingCall() is supposed to be called from interrupts and other low-level stuff, where you don't know whose thread state you would get.
Could you have just one ticker, instead of one per thread? Greg Ewing, Computer Science Dept, +--------------------------------------+ University of Canterbury, | A citizen of NewZealandCorp, a | Christchurch, New Zealand | wholly-owned subsidiary of USA Inc. | greg@cosc.canterbury.ac.nz +--------------------------------------+
Greg> Could you have just one ticker, instead of one per thread? That would make ticker really count down checkinterval ticks. Also, of possible interest is this declaration and comment in longobject.c: static int ticker; /* XXX Could be shared with ceval? */ Any time C code would want to read or update ticker, it would have the GIL, right? Sounds like you could get away with a single ticker. The long int implementation appears to do just fine with only one... Skip
On Friday, August 30, 2002, at 03:37 , Skip Montanaro wrote:
Greg> Could you have just one ticker, instead of one per thread?
That would make ticker really count down checkinterval ticks. Also, of possible interest is this declaration and comment in longobject.c:
static int ticker; /* XXX Could be shared with ceval? */
Any time C code would want to read or update ticker, it would have the GIL, right?
Not if the idea that lead to this thread (clearing ticker if something is put in things_to_do) is implemented, because we may be in an interrupt routine at the time we fiddle things_to_do. And I don't think we can be sure that even clearing is guaranteed to work (if another thread is halfway a load-decrement-store sequence the clear could be lost). -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -
[Jack]
And I don't think we can be sure that even clearing is guaranteed to work (if another thread is halfway a load-decrement-store sequence the clear could be lost).
I think that pretty much kills the idea. has anybody checked whether it causes a measurable speedup? If not, I propose not to bother. --Guido van Rossum (home page: http://www.python.org/~guido/)
Skip> Any time C code would want to read or update ticker, it would have Skip> the GIL, right? Jack> Not if the idea that lead to this thread (clearing ticker if Jack> something is put in things_to_do) is implemented, because we may Jack> be in an interrupt routine at the time we fiddle things_to_do. Jack> And I don't think we can be sure that even clearing is guaranteed Jack> to work (if another thread is halfway a load-decrement-store Jack> sequence the clear could be lost). Hmm... I guess you lost me. The code that fiddles the ticker in ceval.c clearly operates while the GIL is held. I think the code in sysmodule.c that updates the checkinterval works under that assumption as well. The other ticker in longobject.c I'm not so sure about. The patch I submitted doesn't implement the ticker clear that Jeremy originally suggested. It just pulls the ticker and the checkinterval out of the thread state and makes them two globals. They are both manipulated in otherwise the same way. Skip
Skip> Any time C code would want to read or update ticker, it would have Skip> the GIL, right?
Jack> Not if the idea that lead to this thread (clearing ticker if Jack> something is put in things_to_do) is implemented, because we may Jack> be in an interrupt routine at the time we fiddle things_to_do.
Jack> And I don't think we can be sure that even clearing is guaranteed Jack> to work (if another thread is halfway a load-decrement-store Jack> sequence the clear could be lost).
Hmm... I guess you lost me. The code that fiddles the ticker in ceval.c clearly operates while the GIL is held. I think the code in sysmodule.c that updates the checkinterval works under that assumption as well. The other ticker in longobject.c I'm not so sure about.
The patch I submitted doesn't implement the ticker clear that Jeremy originally suggested. It just pulls the ticker and the checkinterval out of the thread state and makes them two globals. They are both manipulated in otherwise the same way.
Skip
Yeah, but the whole *point* would be to save an extra test and (rarely-taken jump) by allowing Jeremy's suggestion to be implemented. Otherwise I don't see much advantage to the patch (or do you see a speed-up?). --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido> Yeah, but the whole *point* would be to save an extra test and Guido> (rarely-taken jump) by allowing Jeremy's suggestion to be Guido> implemented. Otherwise I don't see much advantage to the patch Guido> (or do you see a speed-up?). Haven't tested it for speed. You do save 8 bytes per thread state instance though... S
Haven't tested it for speed. You do save 8 bytes per thread state instance though...
I don't care about the thread state size. If it could speed Python up by 5% I'd gladly add a kilobyte to each thread state... --Guido van Rossum (home page: http://www.python.org/~guido/)
The difference I saw with only the ticker check in ceval was only about 1% for pystone. Python was always faster with the change, but never by much. Jeremy
[Jack Jansen]
Not if the idea that lead to this thread (clearing ticker if something is put in things_to_do) is implemented, because we may be in an interrupt routine at the time we fiddle things_to_do.
And I don't think we can be sure that even clearing is guaranteed to work (if another thread is halfway a load-decrement-store sequence the clear could be lost).
So long as the ticker is declared volatile, the odds of setting ticker to 0 in Py_AddPendingCall during a "bad time" for --ticker are small, a window of a couple machine instructions. Ticker will eventually go to 0 regardless. It's not like things_to_do isn't ignored for "long" stretches of time now either: Py_MakePendingCalls returns immediately unless the thread with the GIL just happens to be the main thread. Even if it is the main thread, there's another race there with some non-main thread happening to call Py_AddPendingCall at the same time. Opening another hole of a couple machine instructions shouldn't make much difference, although Py_MakePendingCalls should also be changed then to reset ticker to 0 in its "early exit because the coincidences I'm relying on haven't happened yet" cases.
So long as the ticker is declared volatile, the odds of setting ticker to 0 in Py_AddPendingCall during a "bad time" for --ticker are small, a window of a couple machine instructions. Ticker will eventually go to 0 regardless. It's not like things_to_do isn't ignored for "long" stretches of time now either: Py_MakePendingCalls returns immediately unless the thread with the GIL just happens to be the main thread. Even if it is the main thread, there's another race there with some non-main thread happening to call Py_AddPendingCall at the same time.
Good point. (Though some apps set the check interval to 1000; well, that would still be fast enough.)
Opening another hole of a couple machine instructions shouldn't make much difference, although Py_MakePendingCalls should also be changed then to reset ticker to 0 in its "early exit because the coincidences I'm relying on haven't happened yet" cases.
OK, let's try it then. --Guido van Rossum (home page: http://www.python.org/~guido/)
>> Opening another hole of a couple machine instructions shouldn't make >> much difference, although Py_MakePendingCalls should also be changed >> then to reset ticker to 0 in its "early exit because the coincidences >> I'm relying on haven't happened yet" cases. Guido> OK, let's try it then. You mean I just wasted my time running pystones? ;-) Just the same, here's the output, after and before. For each setting, I ran pystones twice manually, then the three reported times: with patch: Pystone(1.1) time for 50000 passes = 7.52 This machine benchmarks at 6648.94 pystones/second Pystone(1.1) time for 50000 passes = 7.51 This machine benchmarks at 6657.79 pystones/second Pystone(1.1) time for 50000 passes = 7.5 This machine benchmarks at 6666.67 pystones/second without patch: Pystone(1.1) time for 50000 passes = 7.69 This machine benchmarks at 6501.95 pystones/second Pystone(1.1) time for 50000 passes = 7.68 This machine benchmarks at 6510.42 pystones/second Pystone(1.1) time for 50000 passes = 7.67 This machine benchmarks at 6518.9 pystones/second I was quite surprised at the difference. Someone definitely should check this. The patch is at http://python.org/sf/602191 My guess is that the code is avoiding a lot of pointer dereferences. Oh, wait a minute. I muffed a bit. I initialized the ticker and checkinterval variables to 100. Should have been 10. ... a short time passes while Skip thanks God he's not rebuilding VTK ... With _Py_CheckInterval set to 10 it's still not too shabby: Pystone(1.1) time for 50000 passes = 7.57 This machine benchmarks at 6605.02 pystones/second Pystone(1.1) time for 50000 passes = 7.56 This machine benchmarks at 6613.76 pystones/second Pystone(1.1) time for 50000 passes = 7.55 This machine benchmarks at 6622.52 pystones/second This is still without Jeremy's suggested change. apples-and-oranges-ly, y'rs, Skip
[Skip Montanaro]
... My guess is that the code is avoiding a lot of pointer dereferences. Oh, wait a minute. I muffed a bit. I initialized the ticker and checkinterval variables to 100. Should have been 10.
Someone <wink> may wish to question the historical 10 too. A few weeks ago on c.l.py, a number of programs were posted showing that, on Linux, the thread scheduling is such the the *offer* to switch threads every 10 bytecodes was usually declined: the thread that got the GIL was overwhelmingly most often the thread that released it, so that the whole dance was overwhelmingly most often pure overhead. This may be different under 2.3, where the pthreads GIL is implemented via a semaphore rather than a condvar. But in that case, actually switching threads every 10 bytecodes is an awful lot of thread switching (10 bytecodes don't take as long as they used to <wink>). I don't know how to pick a good "one size fits all" value, but suspect 10 is "clearly too small". In app after app, people who discover sys.setcheckinterval() discover soon after that performance improves if they increase it.
Someone <wink> may wish to question the historical 10 too. A few weeks ago on c.l.py, a number of programs were posted showing that, on Linux, the thread scheduling is such the the *offer* to switch threads every 10 bytecodes was usually declined: the thread that got the GIL was overwhelmingly most often the thread that released it, so that the whole dance was overwhelmingly most often pure overhead. This may be different under 2.3, where the pthreads GIL is implemented via a semaphore rather than a condvar. But in that case, actually switching threads every 10 bytecodes is an awful lot of thread switching (10 bytecodes don't take as long as they used to <wink>).
I don't know how to pick a good "one size fits all" value, but suspect 10 is "clearly too small". In app after app, people who discover sys.setcheckinterval() discover soon after that performance improves if they increase it.
Let's try 100 and see how that works. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Tim]
I don't know how to pick a good "one size fits all" value, but suspect 10 is "clearly too small". In app after app, people who discover sys.setcheckinterval() discover soon after that performance improves if they increase it.
[Guido]
Let's try 100 and see how that works.
+1 here. Skip, you want to fold that into your patch?
Guido> Let's try 100 and see how that works. Tim> +1 here. Skip, you want to fold that into your patch? Done. S
On vrijdag, augustus 30, 2002, at 06:12 , Tim Peters wrote:
[Skip Montanaro]
... My guess is that the code is avoiding a lot of pointer dereferences. Oh, wait a minute. I muffed a bit. I initialized the ticker and checkinterval variables to 100. Should have been 10.
Someone <wink> may wish to question the historical 10 too. A few weeks ago on c.l.py, a number of programs were posted showing that, on Linux, the thread scheduling is such the the *offer* to switch threads every 10 bytecodes was usually declined: the thread that got the GIL was overwhelmingly most often the thread that released it, so that the whole dance was overwhelmingly most often pure overhead.
And it costs! Running pystone without another thread active I get 5500 pystones out of my machine. Running it with another thread active (in a sleep(1000)) I get 4200. After setcheckinterval(100) I'm back up to 5200. For completeness' sake: with no other thread active raising setcheckinterval() doesn't make a difference (it's in the noise, in my measurement it was actually 0.5% slower). We could get a serious speedup for multithreaded programs if we could raise the check interval. Some wild ideas: - Use an exponential (or linear?) backoff. If you attempt to switch and nothing happens you double the check interval, up to a maximum. If you do switch you reset to the minimum. - Combine the above with resetting (to zero? to minimum value if currently >= minimum?) the check interval on anything we know could influence thread schedule (releasing a lock, etc). -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -
[Jack Jansen]
On vrijdag, augustus 30, 2002, at 06:12 , Tim Peters wrote:
Jack, I'm never on vrijdag -- vrijdag is illegal in the US <wink>.
... And it costs!
Running pystone without another thread active I get 5500 pystones out of my machine. Running it with another thread active (in a sleep(1000)) I get 4200. After setcheckinterval(100) I'm back up to 5200.
For completeness' sake: with no other thread active raising setcheckinterval() doesn't make a difference (it's in the noise, in my measurement it was actually 0.5% slower).
We could get a serious speedup for multithreaded programs if we could raise the check interval.
Guido already agreed to try boosting it to 100.
Some wild ideas: - Use an exponential (or linear?) backoff. If you attempt to switch and nothing happens you double the check interval, up to a maximum. If you do switch you reset to the minimum.
On a pthreads system under 2.3, using semaphores, chances are good it will always switch. But unless you're trying to fake soft realtime, it's a real drag on performance to switch so often We can't out-guess this, because it depends on what the *app* wants. Most apps aren't trying to fake soft realtime, so favoring less frequent switches is a good default.
- Combine the above with resetting (to zero? to minimum value if currently >= minimum?) the check interval on anything we know could influence thread schedule (releasing a lock, etc).
You need a model for what it is you're trying to optimize here. I'm just trying to cut useless overheads <wink>.
On vrijdag, augustus 30, 2002, at 10:35 , Tim Peters wrote:
[Jack Jansen]
On vrijdag, augustus 30, 2002, at 06:12 , Tim Peters wrote:
Jack, I'm never on vrijdag -- vrijdag is illegal in the US <wink>.
Oh? Didn't realize that, thought they hadn't gotten any further then outlawing rookdag and drinkdag yet.
Some wild ideas: - Use an exponential (or linear?) backoff. If you attempt to switch and nothing happens you double the check interval, up to a maximum. If you do switch you reset to the minimum.
On a pthreads system under 2.3, using semaphores, chances are good it will always switch. But unless you're trying to fake soft realtime, it's a real drag on performance to switch so often We can't out-guess this, because it depends on what the *app* wants. Most apps aren't trying to fake soft realtime, so favoring less frequent switches is a good default.
Agreed. But the main application I was thinking of are along the lines of one thread doing real computational work and the others doing GUI stuff or serving web-requests or some such. I.e. while busy you care about response for other threads, but you don't want to spend too many cycles on it. I remember seeing bad artefacts of having a large value for the check interval, such as bad responsiveness to control-C, but it could well be that this was MacPython-OS9 specific.
You need a model for what it is you're trying [...]
I though you said you didn't have vrijdag? -- - Jack Jansen <Jack.Jansen@oratrix.com> http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -
[Jeremy]
I noticed that one frequently executed line in the mainloop is testing whether either the ticker has dropped to 0 or if there are things_to_do. Would it be kosher to just drop the ticker to 0 whenever things_to_do is set to true? Then you'd only need to check the ticker each time through.
[Guido]
I think not -- Py_AddPendingCall() is supposed to be called from interrupts and other low-level stuff, where you don't know whose thread state you would get. Too bad, it was a nice idea.
Well ... does each tstate really need its own ticker? If that were a property of the interpreter instead ("number of ticks until it's time for this interpreter to switch threads"), shared across all threads running in that interpreter, then I bet the visible semantics would be much the same. The GIL is always held when the ticker is decremented or reset, so there's nothing non-deterministic in sharing it.
participants (7)
-
Greg Ewing
-
Guido van Rossum
-
Jack Jansen
-
Jack Jansen
-
Jeremy Hylton
-
Skip Montanaro
-
Tim Peters