Mailman 3 [Twisted-Python] Why deferToThread is so slow? - Twisted

Glyph

June 2016

8:24 a.m.

Why do you think this is bad performance? With a direct call, you are doing almost nothing. Just pushing a stack frame. With a deferToThread call, you are: acquiring the GIL, pushing a message into a call queue, releasing the GIL, waiting for the worker thread to wake up, acquiring the GIL, pulling the work off the queue, invoking the work in the worker thread, storing the response on a return queue, writing a byte into a pipe to wake up the reactor thread, releasing the GIL, waiting for the reactor thread to wake up, acquiring the GIL, reading the byte from the pipe, pulling the response work off the queue, executing it, then invoking a Deferred's callback chain. Each of these steps involves a couple of function calls each, and if each takes 3ms like your simple no-op call, you're looking at 48ms just for starters, not taking into account the fact that when you start tossing things into pipes and mutexes the kernel's scheduler gets involved and may (as you noticed) introduce large amounts of non-determinism as other processes and threads run. While I would certainly like to see this get faster, and I think it probably could be optimized somewhat, it's not reasonable to expect that a single function call could be competitive with this sort of algorithm, when it's made up of so many function calls of its own. I could definitely be convinced that this is unreasonably slow but it does not seem so from a first reading. -glyph

Reply

Sign in to reply online Use email software

Mark Williams

10:22 p.m.

On Fri, Jun 03, 2016 at 10:06:43AM +0200, Nagy, Attila wrote:

...

There's a lot going on here! I'm going to break my responses down into three categories: 1) those about the way your test program is written; 2) those about Queue-based thread pools. 3) those about threads in Python; 1) The way your test program is written. There are two puzzling lines in your test program. The first is the use of reactor.callWhenRunning(dtt): https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... I'm guessing the goal here was to call dtt over and over again infinitely? Why not replace this with a while True: loop? The reactor won't begin scheduling threads until it's begun running, so you could rely on the yield blocking the coroutine until the reactor's started. I'll use a while True: loop in my version of your test program. ...the second, and more interesting line: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... Why run this code in a separate thread? Besides the fact that doing so requires you to write thread-safe code (which is hard!), this also means your dtt function will compete with deferToThread for access to the reactor's thread pool. In general, if you're writing performance-sensitive code, you should not use deferToThread. Instead, you should use deferToThreadPool: https://twistedmatrix.com/documents/current/api/twisted.internet.threads.htm... That's because the reactor's threadpool, which deferToThread uses, also handles things like DNS lookups. That means other parts of your Twisted program will compete with your performance-sensitive code. To ensure that your performance-sensitive code gets the best chance to run it should be given its own thread pool, which *only* runs that code. I'll use such a thread pool in my version of your test program. 2) The way Queue-based thread pools work. There's another source of scheduling contention in your program. sys.argv[1] does *not* increase the number of threads -- instead, it increases the *demand* on threads. The number of threads will be no more than 64, per: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... With a value 16 threads, you'll have 16 * 10000 scheduled function calls that can only run across 64 threads. To make this behavior clear, consider this minimal implementation of a Queue-based thread pool: https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-... Note that each testRun needs its own thread, so that waiting for a result doesn't prevent more function calls from being scheduled. This isn't necessary with Twisted because deferToThread(Pool) because that's exactly what Deferreds are for! Also note that this implementation cheats -- results of a function call f(1) aren't routed back to the place where they were requested. Twisted's thread pool does in fact do this. With those caveats, let's try this with just 1 * 10000 scheduled function calls across 64 threads: $ python /tmp/tp.py 1 threadpool: avg 46.067005 us, sync: avg 0.416845 us, 110.51x increase We see a pretty significant difference between a simple synchronous function call and dispatching to a thread via a Queue and obtaining its result. What happens if we bump things up to 16 * 10000? $ python simple_threadpool.py 16 threadpool: avg 667.252771 us, sync: avg 1.551497 us, 430.07x increase threadpool: avg 668.004651 us, sync: avg 1.498873 us, 445.67x increase threadpool: avg 667.386251 us, sync: avg 1.686789 us, 395.65x increase threadpool: avg 668.936246 us, sync: avg 1.448275 us, 461.88x increase threadpool: avg 666.702565 us, sync: avg 1.459888 us, 456.68x increase threadpool: avg 669.740488 us, sync: avg 1.620622 us, 413.26x increase threadpool: avg 666.529074 us, sync: avg 1.657008 us, 402.25x increase threadpool: avg 666.151787 us, sync: avg 1.583210 us, 420.76x increase threadpool: avg 666.713700 us, sync: avg 1.509841 us, 441.58x increase threadpool: avg 666.670518 us, sync: avg 1.533161 us, 434.83x increase threadpool: avg 666.880252 us, sync: avg 1.536666 us, 433.98x increase threadpool: avg 666.819521 us, sync: avg 1.507123 us, 442.45x increase threadpool: avg 664.132656 us, sync: avg 1.624818 us, 408.74x increase threadpool: avg 667.016451 us, sync: avg 1.490790 us, 447.42x increase threadpool: avg 668.779542 us, sync: avg 1.488286 us, 449.36x increase threadpool: avg 667.080139 us, sync: avg 1.524386 us, 437.61x increase Not only did the the difference between the two increase, it also took longer to run our synchronus function in a given thread! The difference between the synchronous and threadpool based calls is due to the fact that as the amount of work increases, the number of workers to do it doesn't. That means the queue keeps getting longer and longer. 3) The way threads work in Python. As for our synchronous function, remember that it's running in its own thread; that means it's very likely that it took longer to run because of contention for the GIL. So two things to remember about using thread pools in Python: a) Increasing the amount of work without increasing the number of workers always makes latency worse. This is true of queues and threads in any environment. b) The GIL *will* become a bottleneck with larges amounts of work and an increasing number of threads. That means you can't just keep adding threads to address the workload problem described in a). Now, let's take what we've learned and apply it to a different version of your test program.

...

BTW, I've run this on pypy 5.1.1 with even worse results: https://mail.python.org/pipermail/pypy-dev/2016-June/014477.html

...that also has an affordance to not use inlineCallbacks, as these likely interfere with PyPy's JIT: https://twistedmatrix.com/trac/ticket/6278 Here's the program: https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-... Let's try it with 1 * 10000 scheduled functions: $ python twisted_threadpool.py 1 deferToThread: avg 121.014680 us, sync: avg 0.463771 us, 260.94x increase deferToThread: avg 130.383369 us, sync: avg 0.475454 us, 274.23x increase deferToThread: avg 125.587505 us, sync: avg 0.466942 us, 268.96x increase deferToThread: avg 124.141280 us, sync: avg 0.469970 us, 264.15x increase deferToThread: avg 135.672952 us, sync: avg 0.505450 us, 268.42x increase deferToThread: avg 130.711776 us, sync: avg 0.483895 us, 270.12x increase deferToThread: avg 118.456074 us, sync: avg 0.438543 us, 270.11x increase deferToThread: avg 137.559747 us, sync: avg 0.504330 us, 272.76x increase deferToThread: avg 121.825337 us, sync: avg 0.459836 us, 264.93x increase deferToThread: avg 141.214092 us, sync: avg 0.539142 us, 261.92x increase ... So Twisted's thread pool is about twice as slow as our toy one. Not too bad! With 16 * 10000 scheduled functions: deferToThread: avg 1691.969863 us, sync: avg 0.444171 us, 3809.28x increase deferToThread: avg 1693.141545 us, sync: avg 0.434609 us, 3895.78x increase deferToThread: avg 1693.619694 us, sync: avg 0.443431 us, 3819.35x increase deferToThread: avg 1693.614949 us, sync: avg 0.439235 us, 3855.83x increase deferToThread: avg 1694.422696 us, sync: avg 0.435658 us, 3889.34x increase deferToThread: avg 1694.554698 us, sync: avg 0.431748 us, 3924.87x increase deferToThread: avg 1694.615262 us, sync: avg 0.430913 us, 3932.61x increase deferToThread: avg 1694.747073 us, sync: avg 0.428314 us, 3956.79x increase deferToThread: avg 1694.701245 us, sync: avg 0.427885 us, 3960.65x increase deferToThread: avg 1695.052327 us, sync: avg 0.419444 us, 4041.19x increase deferToThread: avg 1695.149564 us, sync: avg 0.432749 us, 3917.16x increase deferToThread: avg 1695.340127 us, sync: avg 0.432845 us, 3916.74x increase deferToThread: avg 1695.489345 us, sync: avg 0.433822 us, 3908.26x increase deferToThread: avg 1695.886709 us, sync: avg 0.436207 us, 3887.81x increase deferToThread: avg 1696.334289 us, sync: avg 0.438710 us, 3866.64x increase deferToThread: avg 1696.634512 us, sync: avg 0.434704 us, 3902.96x increase Our synchronous function hasn't gotten worse because there are fewer threads at play (thanks, Twisted!), and the difference between the two has gone up by about 16x (3809.28 / 16 = 238.08 -- and we were seeing ~260x) For fun, here are the PyPy results for the code that *doesn't* use inlineCallbacks: $ pypy twisted_threadpool.py 1 deferred deferToThread: avg 248.463297 us, sync: avg 0.979066 us, 253.78x increase deferToThread: avg 89.544964 us, sync: avg 0.260448 us, 343.81x increase deferToThread: avg 63.347292 us, sync: avg 0.204158 us, 310.29x increase deferToThread: avg 63.825631 us, sync: avg 0.205946 us, 309.91x increase deferToThread: avg 55.110717 us, sync: avg 0.193667 us, 284.56x increase deferToThread: avg 51.918244 us, sync: avg 0.212955 us, 243.80x increase ... deferToThread: avg 62.353158 us, sync: avg 0.239372 us, 260.49x increase deferToThread: avg 40.433884 us, sync: avg 0.148630 us, 272.04x increase deferToThread: avg 65.630174 us, sync: avg 0.165820 us, 395.79x increase deferToThread: avg 85.674763 us, sync: avg 0.259972 us, 329.55x increase deferToThread: avg 65.085721 us, sync: avg 0.198150 us, 328.47x increase deferToThread: avg 44.353342 us, sync: avg 0.153947 us, 288.11x increase ... Looks like PyPy is doing a pretty good job! (I won't show it here, but there's a comment in that gist that contains links to vmprof PyPy runs for both the inlineCallback and inlineCallback-free versions. Looks like inlineCallbacks *does* get optimized pretty well...) Thins to consider: 1) Use your own thread pool - see the gist I wrote for how to do that. 2) The behavior you were observing was due to increasing the amount of work but not the workers to do it, and occurs with any thread pool implementation. 3) PyPy can be really fast! Hope this helps. Let me know if anything's unclear! Best, Mark

Reply

Sign in to reply online Use email software

Glyph

2:55 a.m.

...

On Jun 3, 2016, at 19:10, Gelin Yan <dynamicgl@gmail.com> wrote:

...

Hi Mark

You mentioned deferToThreadPool. I am curious whether the standard threadpool in multipleprocessing.dummy works with twisted.

deferToThreadPool uses a Twisted threadpool interface, not a stdlib one. -glyph

Reply

Sign in to reply online Use email software

Glyph

June 2016

8:24 a.m.

...

Why do you think this is bad performance? With a direct call, you are doing almost nothing. Just pushing a stack frame. With a deferToThread call, you are: acquiring the GIL, pushing a message into a call queue, releasing the GIL, waiting for the worker thread to wake up, acquiring the GIL, pulling the work off the queue, invoking the work in the worker thread, storing the response on a return queue, writing a byte into a pipe to wake up the reactor thread, releasing the GIL, waiting for the reactor thread to wake up, acquiring the GIL, reading the byte from the pipe, pulling the response work off the queue, executing it, then invoking a Deferred's callback chain. Each of these steps involves a couple of function calls each, and if each takes 3ms like your simple no-op call, you're looking at 48ms just for starters, not taking into account the fact that when you start tossing things into pipes and mutexes the kernel's scheduler gets involved and may (as you noticed) introduce large amounts of non-determinism as other processes and threads run. While I would certainly like to see this get faster, and I think it probably could be optimized somewhat, it's not reasonable to expect that a single function call could be competitive with this sort of algorithm, when it's made up of so many function calls of its own. I could definitely be convinced that this is unreasonably slow but it does not seem so from a first reading. -glyph

Reply

Sign in to reply online Use email software

Mark Williams

10:22 p.m.

On Fri, Jun 03, 2016 at 10:06:43AM +0200, Nagy, Attila wrote:

...

There's a lot going on here! I'm going to break my responses down into three categories: 1) those about the way your test program is written; 2) those about Queue-based thread pools. 3) those about threads in Python; 1) The way your test program is written. There are two puzzling lines in your test program. The first is the use of reactor.callWhenRunning(dtt): https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... I'm guessing the goal here was to call dtt over and over again infinitely? Why not replace this with a while True: loop? The reactor won't begin scheduling threads until it's begun running, so you could rely on the yield blocking the coroutine until the reactor's started. I'll use a while True: loop in my version of your test program. ...the second, and more interesting line: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... Why run this code in a separate thread? Besides the fact that doing so requires you to write thread-safe code (which is hard!), this also means your dtt function will compete with deferToThread for access to the reactor's thread pool. In general, if you're writing performance-sensitive code, you should not use deferToThread. Instead, you should use deferToThreadPool: https://twistedmatrix.com/documents/current/api/twisted.internet.threads.htm... That's because the reactor's threadpool, which deferToThread uses, also handles things like DNS lookups. That means other parts of your Twisted program will compete with your performance-sensitive code. To ensure that your performance-sensitive code gets the best chance to run it should be given its own thread pool, which *only* runs that code. I'll use such a thread pool in my version of your test program. 2) The way Queue-based thread pools work. There's another source of scheduling contention in your program. sys.argv[1] does *not* increase the number of threads -- instead, it increases the *demand* on threads. The number of threads will be no more than 64, per: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... With a value 16 threads, you'll have 16 * 10000 scheduled function calls that can only run across 64 threads. To make this behavior clear, consider this minimal implementation of a Queue-based thread pool: https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-... Note that each testRun needs its own thread, so that waiting for a result doesn't prevent more function calls from being scheduled. This isn't necessary with Twisted because deferToThread(Pool) because that's exactly what Deferreds are for! Also note that this implementation cheats -- results of a function call f(1) aren't routed back to the place where they were requested. Twisted's thread pool does in fact do this. With those caveats, let's try this with just 1 * 10000 scheduled function calls across 64 threads: $ python /tmp/tp.py 1 threadpool: avg 46.067005 us, sync: avg 0.416845 us, 110.51x increase We see a pretty significant difference between a simple synchronous function call and dispatching to a thread via a Queue and obtaining its result. What happens if we bump things up to 16 * 10000? $ python simple_threadpool.py 16 threadpool: avg 667.252771 us, sync: avg 1.551497 us, 430.07x increase threadpool: avg 668.004651 us, sync: avg 1.498873 us, 445.67x increase threadpool: avg 667.386251 us, sync: avg 1.686789 us, 395.65x increase threadpool: avg 668.936246 us, sync: avg 1.448275 us, 461.88x increase threadpool: avg 666.702565 us, sync: avg 1.459888 us, 456.68x increase threadpool: avg 669.740488 us, sync: avg 1.620622 us, 413.26x increase threadpool: avg 666.529074 us, sync: avg 1.657008 us, 402.25x increase threadpool: avg 666.151787 us, sync: avg 1.583210 us, 420.76x increase threadpool: avg 666.713700 us, sync: avg 1.509841 us, 441.58x increase threadpool: avg 666.670518 us, sync: avg 1.533161 us, 434.83x increase threadpool: avg 666.880252 us, sync: avg 1.536666 us, 433.98x increase threadpool: avg 666.819521 us, sync: avg 1.507123 us, 442.45x increase threadpool: avg 664.132656 us, sync: avg 1.624818 us, 408.74x increase threadpool: avg 667.016451 us, sync: avg 1.490790 us, 447.42x increase threadpool: avg 668.779542 us, sync: avg 1.488286 us, 449.36x increase threadpool: avg 667.080139 us, sync: avg 1.524386 us, 437.61x increase Not only did the the difference between the two increase, it also took longer to run our synchronus function in a given thread! The difference between the synchronous and threadpool based calls is due to the fact that as the amount of work increases, the number of workers to do it doesn't. That means the queue keeps getting longer and longer. 3) The way threads work in Python. As for our synchronous function, remember that it's running in its own thread; that means it's very likely that it took longer to run because of contention for the GIL. So two things to remember about using thread pools in Python: a) Increasing the amount of work without increasing the number of workers always makes latency worse. This is true of queues and threads in any environment. b) The GIL *will* become a bottleneck with larges amounts of work and an increasing number of threads. That means you can't just keep adding threads to address the workload problem described in a). Now, let's take what we've learned and apply it to a different version of your test program.

...

BTW, I've run this on pypy 5.1.1 with even worse results: https://mail.python.org/pipermail/pypy-dev/2016-June/014477.html

...that also has an affordance to not use inlineCallbacks, as these likely interfere with PyPy's JIT: https://twistedmatrix.com/trac/ticket/6278 Here's the program: https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-... Let's try it with 1 * 10000 scheduled functions: $ python twisted_threadpool.py 1 deferToThread: avg 121.014680 us, sync: avg 0.463771 us, 260.94x increase deferToThread: avg 130.383369 us, sync: avg 0.475454 us, 274.23x increase deferToThread: avg 125.587505 us, sync: avg 0.466942 us, 268.96x increase deferToThread: avg 124.141280 us, sync: avg 0.469970 us, 264.15x increase deferToThread: avg 135.672952 us, sync: avg 0.505450 us, 268.42x increase deferToThread: avg 130.711776 us, sync: avg 0.483895 us, 270.12x increase deferToThread: avg 118.456074 us, sync: avg 0.438543 us, 270.11x increase deferToThread: avg 137.559747 us, sync: avg 0.504330 us, 272.76x increase deferToThread: avg 121.825337 us, sync: avg 0.459836 us, 264.93x increase deferToThread: avg 141.214092 us, sync: avg 0.539142 us, 261.92x increase ... So Twisted's thread pool is about twice as slow as our toy one. Not too bad! With 16 * 10000 scheduled functions: deferToThread: avg 1691.969863 us, sync: avg 0.444171 us, 3809.28x increase deferToThread: avg 1693.141545 us, sync: avg 0.434609 us, 3895.78x increase deferToThread: avg 1693.619694 us, sync: avg 0.443431 us, 3819.35x increase deferToThread: avg 1693.614949 us, sync: avg 0.439235 us, 3855.83x increase deferToThread: avg 1694.422696 us, sync: avg 0.435658 us, 3889.34x increase deferToThread: avg 1694.554698 us, sync: avg 0.431748 us, 3924.87x increase deferToThread: avg 1694.615262 us, sync: avg 0.430913 us, 3932.61x increase deferToThread: avg 1694.747073 us, sync: avg 0.428314 us, 3956.79x increase deferToThread: avg 1694.701245 us, sync: avg 0.427885 us, 3960.65x increase deferToThread: avg 1695.052327 us, sync: avg 0.419444 us, 4041.19x increase deferToThread: avg 1695.149564 us, sync: avg 0.432749 us, 3917.16x increase deferToThread: avg 1695.340127 us, sync: avg 0.432845 us, 3916.74x increase deferToThread: avg 1695.489345 us, sync: avg 0.433822 us, 3908.26x increase deferToThread: avg 1695.886709 us, sync: avg 0.436207 us, 3887.81x increase deferToThread: avg 1696.334289 us, sync: avg 0.438710 us, 3866.64x increase deferToThread: avg 1696.634512 us, sync: avg 0.434704 us, 3902.96x increase Our synchronous function hasn't gotten worse because there are fewer threads at play (thanks, Twisted!), and the difference between the two has gone up by about 16x (3809.28 / 16 = 238.08 -- and we were seeing ~260x) For fun, here are the PyPy results for the code that *doesn't* use inlineCallbacks: $ pypy twisted_threadpool.py 1 deferred deferToThread: avg 248.463297 us, sync: avg 0.979066 us, 253.78x increase deferToThread: avg 89.544964 us, sync: avg 0.260448 us, 343.81x increase deferToThread: avg 63.347292 us, sync: avg 0.204158 us, 310.29x increase deferToThread: avg 63.825631 us, sync: avg 0.205946 us, 309.91x increase deferToThread: avg 55.110717 us, sync: avg 0.193667 us, 284.56x increase deferToThread: avg 51.918244 us, sync: avg 0.212955 us, 243.80x increase ... deferToThread: avg 62.353158 us, sync: avg 0.239372 us, 260.49x increase deferToThread: avg 40.433884 us, sync: avg 0.148630 us, 272.04x increase deferToThread: avg 65.630174 us, sync: avg 0.165820 us, 395.79x increase deferToThread: avg 85.674763 us, sync: avg 0.259972 us, 329.55x increase deferToThread: avg 65.085721 us, sync: avg 0.198150 us, 328.47x increase deferToThread: avg 44.353342 us, sync: avg 0.153947 us, 288.11x increase ... Looks like PyPy is doing a pretty good job! (I won't show it here, but there's a comment in that gist that contains links to vmprof PyPy runs for both the inlineCallback and inlineCallback-free versions. Looks like inlineCallbacks *does* get optimized pretty well...) Thins to consider: 1) Use your own thread pool - see the gist I wrote for how to do that. 2) The behavior you were observing was due to increasing the amount of work but not the workers to do it, and occurs with any thread pool implementation. 3) PyPy can be really fast! Hope this helps. Let me know if anything's unclear! Best, Mark