[Twisted-Python] Twisted tips for designing highly concurrent twisted REST API

Hello folks, I recently stumbled upon twisted and was wondering if it could suit my needs. On one hand, I want to use python but on another hand there are all these scalability concerns with this language so, I though I would pick the brains of the community. So.. a flask based app would look something like this. similar_types = ['foo', 'bar', 'baz'] def long_computation(rec_type): # some long computation return recs @app.route('/fetch_similar_users/<user_id>' def fetch_similar_users(user_id) r = json.loads(requests.get('url/to/fetch/%s'%user_id).text) recs = {} for stype in similar_types: recs[stype] = long_computation(rec_type) return recs Now, I tried to "twistify" but it failed. @defer.inlinecallbacks def long_computation(rec_type): # some long computation *defer.returnValue(recs)* @defer.inlinecallbacks def fetch_data(user_id): r = yieldjson.loads(requests.get('url/to/fetch/%s'%user_id).text) defer.returnValue(r) @defer.inlinecallbacks def fetch_recs(user_id): data = yield fetch_data(user_id) recs = {} for stype in similar_types: d = defer.ToThread(fetch_data, *(stype)) rec = yield d recs[stype] = rec defer.returnValue(recs) I wrapped all the above in twisted render_Get method.. but then I did a load test with locust (https://docs.locust.io/en/latest/what-is-locust.html) framework. It choked. As the time progressed, the response time increased. I am guessing, things are still blocking. Can you please help me look into the right place. Why exactly am I seeing increase in response time as the time progresses. I am guessing things are still working in "blocking" fashion but i thought the above should run things in async. Thanks

Sorry I had a typo in twisted program @defer.inlinecallbacks def long_computation(rec_type, data): # some long computation *defer.returnValue(recs)* @defer.inlinecallbacks def fetch_data(user_id): r = yieldjson.loads(requests.get('url/to/fetch/%s'%user_id).text) defer.returnValue(r) @defer.inlinecallbacks def fetch_recs(user_id): data = yield fetch_data(user_id) recs = {} for stype in similar_types: *d = defer.ToThread(long_computation, *(stype, data)) // typo was here* rec = yield d recs[stype] = rec defer.returnValue(recs) On Tue, Jun 25, 2019 at 11:48 PM Waqar Khan <wk80333@gmail.com> wrote:

Hi, There are likely a few things wrong here. 1. You are using requests.get() to make a HTTP request. This is blocking. You might consider using Twisted's Agent <https://twistedmatrix.com/documents/current/api/twisted.web.client.Agent.htm...> API instead (or treq <https://github.com/twisted/treq>, which puts a requests-like API atop Agent). 2. As you add load your long computations will be queued. deferToThread <https://twistedmatrix.com/documents/current/api/twisted.internet.threads.htm...> dispatches the long_computation to the reactor's default thread pool <https://twistedmatrix.com/documents/current/api/twisted.internet.interfaces....>. This poll has a maximum size and will queue work once it has spun up that many threads. Rather than using deferToThread (which we should really deprecate as it doesn't accept a reactor parameter...) I'd recommend instantiating your own ThreadPool <https://twistedmatrix.com/documents/current/api/twisted.python.threadpool.Th...> and using deferToThreadPool <https://twistedmatrix.com/documents/current/api/twisted.internet.threads.htm...>. The reactor's own thread pool is really for DNS resolution. You risk deadlocks in a system that ThreadPoolThreadPoolThreadPool 3. The specifics of what long_computation are also important. If it doesn't release the GIL you won't get real parallelism (this is a Python thing, not a Twisted thing). See this recent thread on the topic <https://twistedmatrix.com/pipermail/twisted-python/2019-June/032371.html>. Though the mechanisms differ athis thread on the topicny of the above would cause the response time to increase as you add load. Good luck, Tom On Tue, Jun 25, 2019, at 11:51 PM, Waqar Khan wrote:
** **# some long computation ** *defer.returnValue(recs)**

On Tuesday, 9 July 2019 22:04:11 BST Tom Most wrote: ...snip...
The reactor's own thread pool is really for DNS resolution.
Is that still true in the default case? We are use the twisted code that talks to DNS servers as the threaded resolver adds too much latency.
We pass out the computational work to other processes over unix-domain-sockets to avoid the GIL issues.
Barry

Klein and Crossbar.io seem relevant as well https://crossbario.com/blog/Going-Asynchronous-from-Flask-to-Twisted-Klein/ On Thu, Jul 11, 2019 at 1:46 AM Scott, Barry <barry.scott@forcepoint.com> wrote:

Am 11.07.19 um 23:34 schrieb Sean DiZazzo:
Klein and Crossbar.io seem relevant as well
https://crossbario.com/blog/Going-Asynchronous-from-Flask-to-Twisted-Klein/
yeah, klein is neat! fwiw, this might also be of interest, as it allows to scale-up twisted web (and hence also klein) on multi-core (on linux) https://github.com/crossbario/crossbar-examples/tree/master/benchmark/web combining SO_REUSEPORT with Klein results in a concurrent, async (threadless) server parallelized via processes ..
-- Tobias Oberstein - phone +49 176 2375 2055 - tobias.oberstein@crossbario.com Crossbar.io GmbH - Waldstrasse 18 - 91054 Erlangen HRB 15870 - Amtsgericht Fuerth - Geschäftsfuehrer/CEO - Tobias Oberstein https://crossbar.io https://crossbario.com

On Thu, Jul 11, 2019, at 1:46 AM, Scott, Barry wrote:
As far as I know, yes. The higher-level APIs use getaddrinfo() at least. https://twistedmatrix.com/documents/current/api/twisted.internet._resolver.G... https://github.com/twisted/twisted/blob/c0776850e756adfcdc179a7fd9e4c8f5cbc4... TCP6ClientEndpoint also invoke getaddrinfo() directly. twisted.names is certainly more performance but it's missing some system integration features that make it unsuitable as a default: * No support for the domain or search resolv.conf directives * No NSS lookups (e.g., systemd integration) This is all on Linux, YMMV on other platforms. ---Tom

Hi, Thank you all for your kind response. So, I am trying to use treq library import treq @defer.inlinecallbacks def long_computation(rec_type, data): # some long computation *defer.returnValue(recs)* @defer.inlinecallbacks def fetch_data(user_id): r = yield treq.get('url/to/fetch/%s'%user_id) text = yield r.text() defer.returnValue(text) @defer.inlinecallbacks def fetch_recs(user_id): data = yield fetch_data(user_id) recs = {} for stype in similar_types: *d = defer.ToThread(long_computation, *(stype, data)) // typo was here* Now, I do believe that the call is happening asyncronously. So.. yay.. But then, I feel like I have a misconception on how the yield works. data = yield fetch_data(user_id) I was hoping data here was actual data.. But it is a deferred.. Which makes sense. And then.. this deferred is being passed on instead of the actual data... My couple of questions are: 1) What is the difference between data = yield fetch_data(user_id) and data = fetch_data(user_id) (without yield). How does twisted handle these two ? 2) How do I actually send the data to long computation rather than a deferred. Appreciate all the help. Thanks On Sat, Jul 13, 2019 at 1:57 AM Tom Most <twm@freecog.net> wrote:

Sorry I had a typo in twisted program @defer.inlinecallbacks def long_computation(rec_type, data): # some long computation *defer.returnValue(recs)* @defer.inlinecallbacks def fetch_data(user_id): r = yieldjson.loads(requests.get('url/to/fetch/%s'%user_id).text) defer.returnValue(r) @defer.inlinecallbacks def fetch_recs(user_id): data = yield fetch_data(user_id) recs = {} for stype in similar_types: *d = defer.ToThread(long_computation, *(stype, data)) // typo was here* rec = yield d recs[stype] = rec defer.returnValue(recs) On Tue, Jun 25, 2019 at 11:48 PM Waqar Khan <wk80333@gmail.com> wrote:

Hi, There are likely a few things wrong here. 1. You are using requests.get() to make a HTTP request. This is blocking. You might consider using Twisted's Agent <https://twistedmatrix.com/documents/current/api/twisted.web.client.Agent.htm...> API instead (or treq <https://github.com/twisted/treq>, which puts a requests-like API atop Agent). 2. As you add load your long computations will be queued. deferToThread <https://twistedmatrix.com/documents/current/api/twisted.internet.threads.htm...> dispatches the long_computation to the reactor's default thread pool <https://twistedmatrix.com/documents/current/api/twisted.internet.interfaces....>. This poll has a maximum size and will queue work once it has spun up that many threads. Rather than using deferToThread (which we should really deprecate as it doesn't accept a reactor parameter...) I'd recommend instantiating your own ThreadPool <https://twistedmatrix.com/documents/current/api/twisted.python.threadpool.Th...> and using deferToThreadPool <https://twistedmatrix.com/documents/current/api/twisted.internet.threads.htm...>. The reactor's own thread pool is really for DNS resolution. You risk deadlocks in a system that ThreadPoolThreadPoolThreadPool 3. The specifics of what long_computation are also important. If it doesn't release the GIL you won't get real parallelism (this is a Python thing, not a Twisted thing). See this recent thread on the topic <https://twistedmatrix.com/pipermail/twisted-python/2019-June/032371.html>. Though the mechanisms differ athis thread on the topicny of the above would cause the response time to increase as you add load. Good luck, Tom On Tue, Jun 25, 2019, at 11:51 PM, Waqar Khan wrote:
** **# some long computation ** *defer.returnValue(recs)**

On Tuesday, 9 July 2019 22:04:11 BST Tom Most wrote: ...snip...
The reactor's own thread pool is really for DNS resolution.
Is that still true in the default case? We are use the twisted code that talks to DNS servers as the threaded resolver adds too much latency.
We pass out the computational work to other processes over unix-domain-sockets to avoid the GIL issues.
Barry

Klein and Crossbar.io seem relevant as well https://crossbario.com/blog/Going-Asynchronous-from-Flask-to-Twisted-Klein/ On Thu, Jul 11, 2019 at 1:46 AM Scott, Barry <barry.scott@forcepoint.com> wrote:

Am 11.07.19 um 23:34 schrieb Sean DiZazzo:
Klein and Crossbar.io seem relevant as well
https://crossbario.com/blog/Going-Asynchronous-from-Flask-to-Twisted-Klein/
yeah, klein is neat! fwiw, this might also be of interest, as it allows to scale-up twisted web (and hence also klein) on multi-core (on linux) https://github.com/crossbario/crossbar-examples/tree/master/benchmark/web combining SO_REUSEPORT with Klein results in a concurrent, async (threadless) server parallelized via processes ..
-- Tobias Oberstein - phone +49 176 2375 2055 - tobias.oberstein@crossbario.com Crossbar.io GmbH - Waldstrasse 18 - 91054 Erlangen HRB 15870 - Amtsgericht Fuerth - Geschäftsfuehrer/CEO - Tobias Oberstein https://crossbar.io https://crossbario.com

On Thu, Jul 11, 2019, at 1:46 AM, Scott, Barry wrote:
As far as I know, yes. The higher-level APIs use getaddrinfo() at least. https://twistedmatrix.com/documents/current/api/twisted.internet._resolver.G... https://github.com/twisted/twisted/blob/c0776850e756adfcdc179a7fd9e4c8f5cbc4... TCP6ClientEndpoint also invoke getaddrinfo() directly. twisted.names is certainly more performance but it's missing some system integration features that make it unsuitable as a default: * No support for the domain or search resolv.conf directives * No NSS lookups (e.g., systemd integration) This is all on Linux, YMMV on other platforms. ---Tom

Hi, Thank you all for your kind response. So, I am trying to use treq library import treq @defer.inlinecallbacks def long_computation(rec_type, data): # some long computation *defer.returnValue(recs)* @defer.inlinecallbacks def fetch_data(user_id): r = yield treq.get('url/to/fetch/%s'%user_id) text = yield r.text() defer.returnValue(text) @defer.inlinecallbacks def fetch_recs(user_id): data = yield fetch_data(user_id) recs = {} for stype in similar_types: *d = defer.ToThread(long_computation, *(stype, data)) // typo was here* Now, I do believe that the call is happening asyncronously. So.. yay.. But then, I feel like I have a misconception on how the yield works. data = yield fetch_data(user_id) I was hoping data here was actual data.. But it is a deferred.. Which makes sense. And then.. this deferred is being passed on instead of the actual data... My couple of questions are: 1) What is the difference between data = yield fetch_data(user_id) and data = fetch_data(user_id) (without yield). How does twisted handle these two ? 2) How do I actually send the data to long computation rather than a deferred. Appreciate all the help. Thanks On Sat, Jul 13, 2019 at 1:57 AM Tom Most <twm@freecog.net> wrote:
participants (5)
-
Scott, Barry
-
Sean DiZazzo
-
Tobias Oberstein
-
Tom Most
-
Waqar Khan