[Twisted-Python] Why deferToThread is so slow?

Hi, I have a thread safe synchronous library, which I would like to use in a threadpool using deferToThread. Without using (deferTo)threads I get consistent 1-3 ms response times, with deferring to threadpool, I get 30-300, varying wildly. I've tried to construct a test, which is of course not perfect, but shows some differences, which widen with the number of threads (while directly calling that dumb function remains the same). I can see similar results with using the (database) library, but because it's an external dependency and performance gets worse with the number of queries (threads), I thought it would be best to leave that out of the picture, hence the dumb function. # python /tmp/deft.py 1 deferToThread: avg 316.17 us, sync: avg 1.38 us, 228.71x increase deferToThread: avg 312.92 us, sync: avg 1.38 us, 226.96x increase deferToThread: avg 320.22 us, sync: avg 1.39 us, 230.37x increase deferToThread: avg 317.33 us, sync: avg 1.35 us, 235.24x increase # python /tmp/deft.py 8 deferToThread: avg 2542.90 us, sync: avg 1.37 us, 1854.14x increase deferToThread: avg 2544.50 us, sync: avg 1.35 us, 1878.13x increase deferToThread: avg 2544.47 us, sync: avg 1.36 us, 1864.52x increase deferToThread: avg 2544.52 us, sync: avg 1.38 us, 1839.01x increase deferToThread: avg 2544.92 us, sync: avg 1.36 us, 1871.81x increase deferToThread: avg 2546.71 us, sync: avg 1.39 us, 1830.35x increase deferToThread: avg 2552.38 us, sync: avg 1.35 us, 1893.17x increase deferToThread: avg 2552.40 us, sync: avg 1.36 us, 1870.20x increase # python /tmp/deft.py 16 deferToThread: avg 4745.76 us, sync: avg 1.26 us, 3770.11x increase deferToThread: avg 4748.67 us, sync: avg 1.24 us, 3817.03x increase deferToThread: avg 4749.81 us, sync: avg 1.26 us, 3756.39x increase deferToThread: avg 4749.72 us, sync: avg 1.24 us, 3839.88x increase deferToThread: avg 4749.87 us, sync: avg 1.28 us, 3709.99x increase deferToThread: avg 4752.63 us, sync: avg 1.24 us, 3842.90x increase deferToThread: avg 4752.53 us, sync: avg 1.23 us, 3866.08x increase deferToThread: avg 4752.55 us, sync: avg 1.23 us, 3855.40x increase deferToThread: avg 4754.03 us, sync: avg 1.29 us, 3678.09x increase deferToThread: avg 4754.97 us, sync: avg 1.25 us, 3817.19x increase deferToThread: avg 4755.45 us, sync: avg 1.32 us, 3593.28x increase deferToThread: avg 4756.35 us, sync: avg 1.25 us, 3804.18x increase deferToThread: avg 4756.19 us, sync: avg 1.29 us, 3687.73x increase deferToThread: avg 4757.19 us, sync: avg 1.23 us, 3860.74x increase deferToThread: avg 4758.02 us, sync: avg 1.24 us, 3824.33x increase deferToThread: avg 4759.63 us, sync: avg 1.24 us, 3830.40x increase On 16 threads it takes nearly 5 ms(!) to call a local function... The test program is here: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41 cpython 2.7.11, Twisted 16.2.0 BTW, I've run this on pypy 5.1.1 with even worse results: https://mail.python.org/pipermail/pypy-dev/2016-June/014477.html

Why do you think this is bad performance? With a direct call, you are doing almost nothing. Just pushing a stack frame. With a deferToThread call, you are: acquiring the GIL, pushing a message into a call queue, releasing the GIL, waiting for the worker thread to wake up, acquiring the GIL, pulling the work off the queue, invoking the work in the worker thread, storing the response on a return queue, writing a byte into a pipe to wake up the reactor thread, releasing the GIL, waiting for the reactor thread to wake up, acquiring the GIL, reading the byte from the pipe, pulling the response work off the queue, executing it, then invoking a Deferred's callback chain. Each of these steps involves a couple of function calls each, and if each takes 3ms like your simple no-op call, you're looking at 48ms just for starters, not taking into account the fact that when you start tossing things into pipes and mutexes the kernel's scheduler gets involved and may (as you noticed) introduce large amounts of non-determinism as other processes and threads run. While I would certainly like to see this get faster, and I think it probably could be optimized somewhat, it's not reasonable to expect that a single function call could be competitive with this sort of algorithm, when it's made up of so many function calls of its own. I could definitely be convinced that this is unreasonably slow but it does not seem so from a first reading. -glyph

On 06/03/16 10:24, Glyph wrote:
[...] Sure, this is not the perfect example, I just wanted to measure the plain latency which this solution gives. The whole picture is this: I have an application which runs in uwsgi in multithreaded mode. It uses (the blocking)elasticsearch client. That app can serve queries with some tens of concurrent requests in around 3 ms. For some reasons I would like to rewrite this app in Twisted. If I use the txes2 lib (which is nonblocking), I can achieve around the same performance (although it varies a lot more). This is async, no threads are involved. My problem is that this library lacks several features, so I would like to use the blocking one, which needs to run in threads. When I do the requests in threads (with deferToThread, or just callInThread the whole handler) the response time is around 10-20 times more than uwsgi's threaded and blocking and Twisted's async and becomes highly unpredictable. I haven't looked into the details of Twisted's threadpools, but what I would expect here is the same as using a simple python threadpool (like something uwsgi does, or just in the standard libraries), which according to the results work much faster and predictable than Twisted's. BTW, I use queues in non-twisted programs and they are nowhere to cause several milliseconds(!) of latency. OK, here's a more realistic example: https://gist.github.com/bra-fsn/08734197601e5a63d6a2b56d7b048119 This does what is described above: calls an ES query in a Twisted threadpool and calls it directly in the thread the whole loop runs. With one thread the overhead is somewhat acceptable: deferToThread: avg 2051.00 us, sync: avg 1554.70 us, 1.32x increase The direct call responds in 1.5 ms, while the deferToThread returns in 2ms. Things get worse with the concurrency. With 16 threads the response time is 18 times of the direct call (51 ms vs 2.8 ms!): deferToThread: avg 51515.36 us, sync: avg 2798.19 us, 18.41x increase With 32 threads: deferToThread: avg 108222.73 us, sync: avg 2922.28 us, 37.03x increase I use normal (stdlib) threadpools and I haven't seen this kind of performance degradation. 100 ms is a lot of time...

On Fri, Jun 03, 2016 at 10:06:43AM +0200, Nagy, Attila wrote:
There's a lot going on here! I'm going to break my responses down into three categories: 1) those about the way your test program is written; 2) those about Queue-based thread pools. 3) those about threads in Python; 1) The way your test program is written. There are two puzzling lines in your test program. The first is the use of reactor.callWhenRunning(dtt): https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... I'm guessing the goal here was to call dtt over and over again infinitely? Why not replace this with a while True: loop? The reactor won't begin scheduling threads until it's begun running, so you could rely on the yield blocking the coroutine until the reactor's started. I'll use a while True: loop in my version of your test program. ...the second, and more interesting line: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... Why run this code in a separate thread? Besides the fact that doing so requires you to write thread-safe code (which is hard!), this also means your dtt function will compete with deferToThread for access to the reactor's thread pool. In general, if you're writing performance-sensitive code, you should not use deferToThread. Instead, you should use deferToThreadPool: https://twistedmatrix.com/documents/current/api/twisted.internet.threads.htm... That's because the reactor's threadpool, which deferToThread uses, also handles things like DNS lookups. That means other parts of your Twisted program will compete with your performance-sensitive code. To ensure that your performance-sensitive code gets the best chance to run it should be given its own thread pool, which *only* runs that code. I'll use such a thread pool in my version of your test program. 2) The way Queue-based thread pools work. There's another source of scheduling contention in your program. sys.argv[1] does *not* increase the number of threads -- instead, it increases the *demand* on threads. The number of threads will be no more than 64, per: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... With a value 16 threads, you'll have 16 * 10000 scheduled function calls that can only run across 64 threads. To make this behavior clear, consider this minimal implementation of a Queue-based thread pool: https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-... Note that each testRun needs its own thread, so that waiting for a result doesn't prevent more function calls from being scheduled. This isn't necessary with Twisted because deferToThread(Pool) because that's exactly what Deferreds are for! Also note that this implementation cheats -- results of a function call f(1) aren't routed back to the place where they were requested. Twisted's thread pool does in fact do this. With those caveats, let's try this with just 1 * 10000 scheduled function calls across 64 threads: $ python /tmp/tp.py 1 threadpool: avg 46.067005 us, sync: avg 0.416845 us, 110.51x increase We see a pretty significant difference between a simple synchronous function call and dispatching to a thread via a Queue and obtaining its result. What happens if we bump things up to 16 * 10000? $ python simple_threadpool.py 16 threadpool: avg 667.252771 us, sync: avg 1.551497 us, 430.07x increase threadpool: avg 668.004651 us, sync: avg 1.498873 us, 445.67x increase threadpool: avg 667.386251 us, sync: avg 1.686789 us, 395.65x increase threadpool: avg 668.936246 us, sync: avg 1.448275 us, 461.88x increase threadpool: avg 666.702565 us, sync: avg 1.459888 us, 456.68x increase threadpool: avg 669.740488 us, sync: avg 1.620622 us, 413.26x increase threadpool: avg 666.529074 us, sync: avg 1.657008 us, 402.25x increase threadpool: avg 666.151787 us, sync: avg 1.583210 us, 420.76x increase threadpool: avg 666.713700 us, sync: avg 1.509841 us, 441.58x increase threadpool: avg 666.670518 us, sync: avg 1.533161 us, 434.83x increase threadpool: avg 666.880252 us, sync: avg 1.536666 us, 433.98x increase threadpool: avg 666.819521 us, sync: avg 1.507123 us, 442.45x increase threadpool: avg 664.132656 us, sync: avg 1.624818 us, 408.74x increase threadpool: avg 667.016451 us, sync: avg 1.490790 us, 447.42x increase threadpool: avg 668.779542 us, sync: avg 1.488286 us, 449.36x increase threadpool: avg 667.080139 us, sync: avg 1.524386 us, 437.61x increase Not only did the the difference between the two increase, it also took longer to run our synchronus function in a given thread! The difference between the synchronous and threadpool based calls is due to the fact that as the amount of work increases, the number of workers to do it doesn't. That means the queue keeps getting longer and longer. 3) The way threads work in Python. As for our synchronous function, remember that it's running in its own thread; that means it's very likely that it took longer to run because of contention for the GIL. So two things to remember about using thread pools in Python: a) Increasing the amount of work without increasing the number of workers always makes latency worse. This is true of queues and threads in any environment. b) The GIL *will* become a bottleneck with larges amounts of work and an increasing number of threads. That means you can't just keep adding threads to address the workload problem described in a). Now, let's take what we've learned and apply it to a different version of your test program.
BTW, I've run this on pypy 5.1.1 with even worse results: https://mail.python.org/pipermail/pypy-dev/2016-June/014477.html
...that also has an affordance to not use inlineCallbacks, as these likely interfere with PyPy's JIT: https://twistedmatrix.com/trac/ticket/6278 Here's the program: https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-... Let's try it with 1 * 10000 scheduled functions: $ python twisted_threadpool.py 1 deferToThread: avg 121.014680 us, sync: avg 0.463771 us, 260.94x increase deferToThread: avg 130.383369 us, sync: avg 0.475454 us, 274.23x increase deferToThread: avg 125.587505 us, sync: avg 0.466942 us, 268.96x increase deferToThread: avg 124.141280 us, sync: avg 0.469970 us, 264.15x increase deferToThread: avg 135.672952 us, sync: avg 0.505450 us, 268.42x increase deferToThread: avg 130.711776 us, sync: avg 0.483895 us, 270.12x increase deferToThread: avg 118.456074 us, sync: avg 0.438543 us, 270.11x increase deferToThread: avg 137.559747 us, sync: avg 0.504330 us, 272.76x increase deferToThread: avg 121.825337 us, sync: avg 0.459836 us, 264.93x increase deferToThread: avg 141.214092 us, sync: avg 0.539142 us, 261.92x increase ... So Twisted's thread pool is about twice as slow as our toy one. Not too bad! With 16 * 10000 scheduled functions: deferToThread: avg 1691.969863 us, sync: avg 0.444171 us, 3809.28x increase deferToThread: avg 1693.141545 us, sync: avg 0.434609 us, 3895.78x increase deferToThread: avg 1693.619694 us, sync: avg 0.443431 us, 3819.35x increase deferToThread: avg 1693.614949 us, sync: avg 0.439235 us, 3855.83x increase deferToThread: avg 1694.422696 us, sync: avg 0.435658 us, 3889.34x increase deferToThread: avg 1694.554698 us, sync: avg 0.431748 us, 3924.87x increase deferToThread: avg 1694.615262 us, sync: avg 0.430913 us, 3932.61x increase deferToThread: avg 1694.747073 us, sync: avg 0.428314 us, 3956.79x increase deferToThread: avg 1694.701245 us, sync: avg 0.427885 us, 3960.65x increase deferToThread: avg 1695.052327 us, sync: avg 0.419444 us, 4041.19x increase deferToThread: avg 1695.149564 us, sync: avg 0.432749 us, 3917.16x increase deferToThread: avg 1695.340127 us, sync: avg 0.432845 us, 3916.74x increase deferToThread: avg 1695.489345 us, sync: avg 0.433822 us, 3908.26x increase deferToThread: avg 1695.886709 us, sync: avg 0.436207 us, 3887.81x increase deferToThread: avg 1696.334289 us, sync: avg 0.438710 us, 3866.64x increase deferToThread: avg 1696.634512 us, sync: avg 0.434704 us, 3902.96x increase Our synchronous function hasn't gotten worse because there are fewer threads at play (thanks, Twisted!), and the difference between the two has gone up by about 16x (3809.28 / 16 = 238.08 -- and we were seeing ~260x) For fun, here are the PyPy results for the code that *doesn't* use inlineCallbacks: $ pypy twisted_threadpool.py 1 deferred deferToThread: avg 248.463297 us, sync: avg 0.979066 us, 253.78x increase deferToThread: avg 89.544964 us, sync: avg 0.260448 us, 343.81x increase deferToThread: avg 63.347292 us, sync: avg 0.204158 us, 310.29x increase deferToThread: avg 63.825631 us, sync: avg 0.205946 us, 309.91x increase deferToThread: avg 55.110717 us, sync: avg 0.193667 us, 284.56x increase deferToThread: avg 51.918244 us, sync: avg 0.212955 us, 243.80x increase ... deferToThread: avg 62.353158 us, sync: avg 0.239372 us, 260.49x increase deferToThread: avg 40.433884 us, sync: avg 0.148630 us, 272.04x increase deferToThread: avg 65.630174 us, sync: avg 0.165820 us, 395.79x increase deferToThread: avg 85.674763 us, sync: avg 0.259972 us, 329.55x increase deferToThread: avg 65.085721 us, sync: avg 0.198150 us, 328.47x increase deferToThread: avg 44.353342 us, sync: avg 0.153947 us, 288.11x increase ... Looks like PyPy is doing a pretty good job! (I won't show it here, but there's a comment in that gist that contains links to vmprof PyPy runs for both the inlineCallback and inlineCallback-free versions. Looks like inlineCallbacks *does* get optimized pretty well...) Thins to consider: 1) Use your own thread pool - see the gist I wrote for how to do that. 2) The behavior you were observing was due to increasing the amount of work but not the workers to do it, and occurs with any thread pool implementation. 3) PyPy can be really fast! Hope this helps. Let me know if anything's unclear! Best, Mark

On Jun 3, 2016, at 19:10, Gelin Yan <dynamicgl@gmail.com> wrote:
Hi Mark
You mentioned deferToThreadPool. I am curious whether the standard threadpool in multipleprocessing.dummy works with twisted.
deferToThreadPool uses a Twisted threadpool interface, not a stdlib one. -glyph

Why do you think this is bad performance? With a direct call, you are doing almost nothing. Just pushing a stack frame. With a deferToThread call, you are: acquiring the GIL, pushing a message into a call queue, releasing the GIL, waiting for the worker thread to wake up, acquiring the GIL, pulling the work off the queue, invoking the work in the worker thread, storing the response on a return queue, writing a byte into a pipe to wake up the reactor thread, releasing the GIL, waiting for the reactor thread to wake up, acquiring the GIL, reading the byte from the pipe, pulling the response work off the queue, executing it, then invoking a Deferred's callback chain. Each of these steps involves a couple of function calls each, and if each takes 3ms like your simple no-op call, you're looking at 48ms just for starters, not taking into account the fact that when you start tossing things into pipes and mutexes the kernel's scheduler gets involved and may (as you noticed) introduce large amounts of non-determinism as other processes and threads run. While I would certainly like to see this get faster, and I think it probably could be optimized somewhat, it's not reasonable to expect that a single function call could be competitive with this sort of algorithm, when it's made up of so many function calls of its own. I could definitely be convinced that this is unreasonably slow but it does not seem so from a first reading. -glyph

On 06/03/16 10:24, Glyph wrote:
[...] Sure, this is not the perfect example, I just wanted to measure the plain latency which this solution gives. The whole picture is this: I have an application which runs in uwsgi in multithreaded mode. It uses (the blocking)elasticsearch client. That app can serve queries with some tens of concurrent requests in around 3 ms. For some reasons I would like to rewrite this app in Twisted. If I use the txes2 lib (which is nonblocking), I can achieve around the same performance (although it varies a lot more). This is async, no threads are involved. My problem is that this library lacks several features, so I would like to use the blocking one, which needs to run in threads. When I do the requests in threads (with deferToThread, or just callInThread the whole handler) the response time is around 10-20 times more than uwsgi's threaded and blocking and Twisted's async and becomes highly unpredictable. I haven't looked into the details of Twisted's threadpools, but what I would expect here is the same as using a simple python threadpool (like something uwsgi does, or just in the standard libraries), which according to the results work much faster and predictable than Twisted's. BTW, I use queues in non-twisted programs and they are nowhere to cause several milliseconds(!) of latency. OK, here's a more realistic example: https://gist.github.com/bra-fsn/08734197601e5a63d6a2b56d7b048119 This does what is described above: calls an ES query in a Twisted threadpool and calls it directly in the thread the whole loop runs. With one thread the overhead is somewhat acceptable: deferToThread: avg 2051.00 us, sync: avg 1554.70 us, 1.32x increase The direct call responds in 1.5 ms, while the deferToThread returns in 2ms. Things get worse with the concurrency. With 16 threads the response time is 18 times of the direct call (51 ms vs 2.8 ms!): deferToThread: avg 51515.36 us, sync: avg 2798.19 us, 18.41x increase With 32 threads: deferToThread: avg 108222.73 us, sync: avg 2922.28 us, 37.03x increase I use normal (stdlib) threadpools and I haven't seen this kind of performance degradation. 100 ms is a lot of time...

On Fri, Jun 03, 2016 at 10:06:43AM +0200, Nagy, Attila wrote:
There's a lot going on here! I'm going to break my responses down into three categories: 1) those about the way your test program is written; 2) those about Queue-based thread pools. 3) those about threads in Python; 1) The way your test program is written. There are two puzzling lines in your test program. The first is the use of reactor.callWhenRunning(dtt): https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... I'm guessing the goal here was to call dtt over and over again infinitely? Why not replace this with a while True: loop? The reactor won't begin scheduling threads until it's begun running, so you could rely on the yield blocking the coroutine until the reactor's started. I'll use a while True: loop in my version of your test program. ...the second, and more interesting line: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... Why run this code in a separate thread? Besides the fact that doing so requires you to write thread-safe code (which is hard!), this also means your dtt function will compete with deferToThread for access to the reactor's thread pool. In general, if you're writing performance-sensitive code, you should not use deferToThread. Instead, you should use deferToThreadPool: https://twistedmatrix.com/documents/current/api/twisted.internet.threads.htm... That's because the reactor's threadpool, which deferToThread uses, also handles things like DNS lookups. That means other parts of your Twisted program will compete with your performance-sensitive code. To ensure that your performance-sensitive code gets the best chance to run it should be given its own thread pool, which *only* runs that code. I'll use such a thread pool in my version of your test program. 2) The way Queue-based thread pools work. There's another source of scheduling contention in your program. sys.argv[1] does *not* increase the number of threads -- instead, it increases the *demand* on threads. The number of threads will be no more than 64, per: https://gist.github.com/bra-fsn/1fd481b44590a939e849cb9073ba1a41#file-defert... With a value 16 threads, you'll have 16 * 10000 scheduled function calls that can only run across 64 threads. To make this behavior clear, consider this minimal implementation of a Queue-based thread pool: https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-... Note that each testRun needs its own thread, so that waiting for a result doesn't prevent more function calls from being scheduled. This isn't necessary with Twisted because deferToThread(Pool) because that's exactly what Deferreds are for! Also note that this implementation cheats -- results of a function call f(1) aren't routed back to the place where they were requested. Twisted's thread pool does in fact do this. With those caveats, let's try this with just 1 * 10000 scheduled function calls across 64 threads: $ python /tmp/tp.py 1 threadpool: avg 46.067005 us, sync: avg 0.416845 us, 110.51x increase We see a pretty significant difference between a simple synchronous function call and dispatching to a thread via a Queue and obtaining its result. What happens if we bump things up to 16 * 10000? $ python simple_threadpool.py 16 threadpool: avg 667.252771 us, sync: avg 1.551497 us, 430.07x increase threadpool: avg 668.004651 us, sync: avg 1.498873 us, 445.67x increase threadpool: avg 667.386251 us, sync: avg 1.686789 us, 395.65x increase threadpool: avg 668.936246 us, sync: avg 1.448275 us, 461.88x increase threadpool: avg 666.702565 us, sync: avg 1.459888 us, 456.68x increase threadpool: avg 669.740488 us, sync: avg 1.620622 us, 413.26x increase threadpool: avg 666.529074 us, sync: avg 1.657008 us, 402.25x increase threadpool: avg 666.151787 us, sync: avg 1.583210 us, 420.76x increase threadpool: avg 666.713700 us, sync: avg 1.509841 us, 441.58x increase threadpool: avg 666.670518 us, sync: avg 1.533161 us, 434.83x increase threadpool: avg 666.880252 us, sync: avg 1.536666 us, 433.98x increase threadpool: avg 666.819521 us, sync: avg 1.507123 us, 442.45x increase threadpool: avg 664.132656 us, sync: avg 1.624818 us, 408.74x increase threadpool: avg 667.016451 us, sync: avg 1.490790 us, 447.42x increase threadpool: avg 668.779542 us, sync: avg 1.488286 us, 449.36x increase threadpool: avg 667.080139 us, sync: avg 1.524386 us, 437.61x increase Not only did the the difference between the two increase, it also took longer to run our synchronus function in a given thread! The difference between the synchronous and threadpool based calls is due to the fact that as the amount of work increases, the number of workers to do it doesn't. That means the queue keeps getting longer and longer. 3) The way threads work in Python. As for our synchronous function, remember that it's running in its own thread; that means it's very likely that it took longer to run because of contention for the GIL. So two things to remember about using thread pools in Python: a) Increasing the amount of work without increasing the number of workers always makes latency worse. This is true of queues and threads in any environment. b) The GIL *will* become a bottleneck with larges amounts of work and an increasing number of threads. That means you can't just keep adding threads to address the workload problem described in a). Now, let's take what we've learned and apply it to a different version of your test program.
BTW, I've run this on pypy 5.1.1 with even worse results: https://mail.python.org/pipermail/pypy-dev/2016-June/014477.html
...that also has an affordance to not use inlineCallbacks, as these likely interfere with PyPy's JIT: https://twistedmatrix.com/trac/ticket/6278 Here's the program: https://gist.github.com/markrwilliams/2e40ed0fa06a9e653609dd61fd80ca95#file-... Let's try it with 1 * 10000 scheduled functions: $ python twisted_threadpool.py 1 deferToThread: avg 121.014680 us, sync: avg 0.463771 us, 260.94x increase deferToThread: avg 130.383369 us, sync: avg 0.475454 us, 274.23x increase deferToThread: avg 125.587505 us, sync: avg 0.466942 us, 268.96x increase deferToThread: avg 124.141280 us, sync: avg 0.469970 us, 264.15x increase deferToThread: avg 135.672952 us, sync: avg 0.505450 us, 268.42x increase deferToThread: avg 130.711776 us, sync: avg 0.483895 us, 270.12x increase deferToThread: avg 118.456074 us, sync: avg 0.438543 us, 270.11x increase deferToThread: avg 137.559747 us, sync: avg 0.504330 us, 272.76x increase deferToThread: avg 121.825337 us, sync: avg 0.459836 us, 264.93x increase deferToThread: avg 141.214092 us, sync: avg 0.539142 us, 261.92x increase ... So Twisted's thread pool is about twice as slow as our toy one. Not too bad! With 16 * 10000 scheduled functions: deferToThread: avg 1691.969863 us, sync: avg 0.444171 us, 3809.28x increase deferToThread: avg 1693.141545 us, sync: avg 0.434609 us, 3895.78x increase deferToThread: avg 1693.619694 us, sync: avg 0.443431 us, 3819.35x increase deferToThread: avg 1693.614949 us, sync: avg 0.439235 us, 3855.83x increase deferToThread: avg 1694.422696 us, sync: avg 0.435658 us, 3889.34x increase deferToThread: avg 1694.554698 us, sync: avg 0.431748 us, 3924.87x increase deferToThread: avg 1694.615262 us, sync: avg 0.430913 us, 3932.61x increase deferToThread: avg 1694.747073 us, sync: avg 0.428314 us, 3956.79x increase deferToThread: avg 1694.701245 us, sync: avg 0.427885 us, 3960.65x increase deferToThread: avg 1695.052327 us, sync: avg 0.419444 us, 4041.19x increase deferToThread: avg 1695.149564 us, sync: avg 0.432749 us, 3917.16x increase deferToThread: avg 1695.340127 us, sync: avg 0.432845 us, 3916.74x increase deferToThread: avg 1695.489345 us, sync: avg 0.433822 us, 3908.26x increase deferToThread: avg 1695.886709 us, sync: avg 0.436207 us, 3887.81x increase deferToThread: avg 1696.334289 us, sync: avg 0.438710 us, 3866.64x increase deferToThread: avg 1696.634512 us, sync: avg 0.434704 us, 3902.96x increase Our synchronous function hasn't gotten worse because there are fewer threads at play (thanks, Twisted!), and the difference between the two has gone up by about 16x (3809.28 / 16 = 238.08 -- and we were seeing ~260x) For fun, here are the PyPy results for the code that *doesn't* use inlineCallbacks: $ pypy twisted_threadpool.py 1 deferred deferToThread: avg 248.463297 us, sync: avg 0.979066 us, 253.78x increase deferToThread: avg 89.544964 us, sync: avg 0.260448 us, 343.81x increase deferToThread: avg 63.347292 us, sync: avg 0.204158 us, 310.29x increase deferToThread: avg 63.825631 us, sync: avg 0.205946 us, 309.91x increase deferToThread: avg 55.110717 us, sync: avg 0.193667 us, 284.56x increase deferToThread: avg 51.918244 us, sync: avg 0.212955 us, 243.80x increase ... deferToThread: avg 62.353158 us, sync: avg 0.239372 us, 260.49x increase deferToThread: avg 40.433884 us, sync: avg 0.148630 us, 272.04x increase deferToThread: avg 65.630174 us, sync: avg 0.165820 us, 395.79x increase deferToThread: avg 85.674763 us, sync: avg 0.259972 us, 329.55x increase deferToThread: avg 65.085721 us, sync: avg 0.198150 us, 328.47x increase deferToThread: avg 44.353342 us, sync: avg 0.153947 us, 288.11x increase ... Looks like PyPy is doing a pretty good job! (I won't show it here, but there's a comment in that gist that contains links to vmprof PyPy runs for both the inlineCallback and inlineCallback-free versions. Looks like inlineCallbacks *does* get optimized pretty well...) Thins to consider: 1) Use your own thread pool - see the gist I wrote for how to do that. 2) The behavior you were observing was due to increasing the amount of work but not the workers to do it, and occurs with any thread pool implementation. 3) PyPy can be really fast! Hope this helps. Let me know if anything's unclear! Best, Mark

On Jun 3, 2016, at 19:10, Gelin Yan <dynamicgl@gmail.com> wrote:
Hi Mark
You mentioned deferToThreadPool. I am curious whether the standard threadpool in multipleprocessing.dummy works with twisted.
deferToThreadPool uses a Twisted threadpool interface, not a stdlib one. -glyph
participants (4)
-
Gelin Yan
-
Glyph
-
Mark Williams
-
Nagy, Attila