[Async-sig] async/sync library reusage

Fri Jun 9 12:55:35 EDT 2017

> On 9 Jun 2017, at 17:28, Guido van Rossum <guido at python.org> wrote:
> 
> At least one of us is still confused. The one-event-loop-per-thread model is supported in asyncio without passing the loop around explicitly. The get_event_loop() implementation stores all its state in thread-locals instance, so it returns the thread's event loop. (Because this is an "advanced" model, you have to explicitly create the event loop with new_event_loop() and make it the default loop for the thread with set_event_loop().)

Aha, ok, so the confused one is me. I did not know this. =) That definitely works a lot better. It admittedly works less well if someone is doing their own custom event loop stuff, but that’s probably an acceptable limitation up until the time that Python 2 goes quietly into the night.

> All in all, I'm a bit curious why you would need to use asyncio at all when you've got a thread per request anyway.

Yeah, so this is a bit of a diversion from the original topic of this thread but I think it’s an idea worth discussing in this space. I want to reframe the question a bit if you don’t mind, so shout if you think I’m not responding to quite what you were asking. In my understanding, the question you’re implicitly asking is this:

"If you have a thread-safe library today (that is, one that allows users to do threaded I/O with appropriate resource pooling and management), why move to a model built on asyncio?”

There are many answers to this question that differ for different libraries with different uses, but for HTTP libraries like urllib3 here are our reasons.

The first is that it turns out that even for HTTP/1.1 you need to write something that amounts to a partial event loop to properly handle the protocol. Good HTTP clients need to watch for responses while they’re uploading body data because if a response arrives during that process body upload should be terminated immediately. This is also required for sensibly handling things like Expect: 100-continue, as well as spotting other intermediate responses and connection teardowns sensibly and without throwing exceptions.

Today urllib3 does not do this, and it has caused us pain, so our v2 branch includes a backport of the Python 3 selectors module and a hand-written partially-complete event loop that only handles the specific cases we need. This is an extra thing for us to debug and maintain, and ultimately it’d be easier to just delegate the whole thing to event loops written by others who promise to maintain them and make them efficient.

The second answer is that I believe good asyncio support in libraries is a vital part of the future of this language, and “good” asyncio support IMO does as little as possible to block the main event loop. Running all of the complex protocol parsing and state manipulation of the Requests stack on a background thread is not cheap, and involves a lot of GIL swapping around. We have found several bug reports complaining about using Requests with largish-numbers of threads, indicating that our big stack of Python code really does cause contention on the GIL if used heavily. In general, having to defer to a thread to run *Python* code in asyncio is IMO a nasty anti-pattern that should be avoided where possible. It is much less bad to defer to a thread to then block on a syscall (e.g. to get an “async” getaddrinfo), but doing so to run a big big stack of Python code is vastly less pleasant for the main event loop.

For this reason, we’d ideally treat asyncio as the first-class citizen and retrofit on the threaded support, rather than the other way around. This goes doubly so when you consider the other reasons for wanting to use asyncio.

The third answer is that HTTP/2 makes all of this much harder. HTTP/2 is a *highly* concurrent protocol. Connections send a lot of control frames back and forth that are invisible to the user working at the semantic HTTP level but that nonetheless need relatively low-latency turnaround (e.g. PING frames). It turns out that in the traditional synchronous HTTP model urllib3 only gets access to the socket to do work when the user calls into our code. If the user goes a “long” time without calling into urllib3, we take a long time to process any data off the connection. In the best case this causes latency spikes as we process all the data that queued up in the socket. In the worst case, this causes us to lose connections we should have been able to keep because we failed to respond to a PING frame in a timely manner.

My experience is that purely synchronous libraries handling HTTP/2 simply cannot provide a positive user experience. HTTP/2 flat-out *requires* either an event loop or a dedicated background thread, and in practice in your dedicated background thread you’d also just end up writing an event loop (see answer 1 again). For this reason, it is basically mandatory for HTTP/2 support in Python to either use an event loop or to spawn out a dedicated C thread that does not hold the GIL to do the I/O (as this thread will be regularly woken up to handle I/O events).

Hopefully this (admittedly horrifyingly long) response helps illuminate why we’re interested in asyncio support. It should be noted that if we find ourselves unable to get it in the short term we may simply resort to offering an “async” API that involves us doing the rough equivalent of running in a thread-pool executor, but I won’t be thrilled about it. ;)

Cory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/async-sig/attachments/20170609/a4bda004/attachment-0001.html>