[Python-ideas] An alternate approach to async IO
Trent Nelson
trent at snakebite.org
Tue Nov 27 18:49:33 CET 2012
I was hoping to have some code to go along with this idea, but the
WSAPoll stuff distracted me, so, I don't have any concrete examples
just yet.
As part of (slowly) catching up on all the async IO discussions, I
reviewed both Twisted's current iocpreactor implementation, as well
as Richard Oudkerk's IOCP/tulip work:
http://mail.python.org/pipermail/python-ideas/2012-November/017684.html
Both implementations caught me a little off-guard. The Twisted
iocpreactor appears to drive a 'one shot' iteration of GetQueued-
CompletionStatus per event loop iteration -- sort of treating it
just like select/poll (in that you call it once to get a list of
things to do, do them, then call it again).
Richard's work sort of works the same way... the proactor drives
the completion port polling via GetQueuedCompletionStatus.
From what I know about overlapped IO and IOCP, this seems like a
really odd way to do things: by *not* using a thread pool, whose
threads' process_some_io_that_just_completed() methods are auto-
matically called* by the underlying OS, you're getting all of the
additional complexity of overlapped IO and IOCP without any of the
concurrency benefits.
[*]: On AIX, Solaris and Windows XP/2003, you'd manually spawn
a bunch of threads and have them call GetQueuedCompletion-
Status or port_get(). Those methods return when IO is
available on the given completion port.
On Windows 7/2008R2, you can leverage the new thread pool
APIs. You don't need to create a single thread yourself,
just spin up a pool and associate the IO completion with
it -- Windows will automatically manage the underlying
threads optimally.
Windows 8/2012 introduces a new API, Registered I/O,
which leverages pre-registered buffers and completion
ports to minimize overhead of copying data from kernel
to user space. It's pretty nifty. You use RIO in concert
with thread pools and IOCP.
Here's the "idea" I had, with zero working code to back it up:
what if we had a bunch of threads in the background whose sole
purpose it was to handle AIO? On Windows/AIX, they would poll
GetQueuedCompletionStatus, on Solaris, get_event().
They're literally raw pthreads and have absolutely nothing to
do with Python's threading.Thread() stuff. They exist solely
in C and can't be interfaced to directly from Python code.
....which means they're free to run outside the GIL, and thus,
multiple cores could be leveraged concurrently. (Only for
processing completed I/O, but hey, it's better than nothing.)
The threads would process the completion port events via C code
and allocate necessary char * buffers on demand. Upon completion
of their processing, let's say, reading 4096 bytes from a socket,
they push their processed event and data to an interlocked* list,
then go back to GetQueuedCompletionStatus/get_event.
You eventually need to process these events from Python code.
Here's where I think this approach is neat: we could expose a new
API that's semantically pretty close to how poll() works now.
Except instead of polling a bunch of non-blocking file descriptors
to see what's ready for reading/writing, you're simply getting a
list of events that completed in the background.
You process the events, then presumably, your event loop starts
again: grab available events, process them. The key here is that
nothing blocks, but not because of non-blocking sockets.
Nothing blocks because the reads() have already taken place and
the writes() return immediately. So your event loop now becomes
this tight little chunk of code that spins as quickly as it can
on processing events. (This would lend itself very well to the
Twisted notion of deferring blocking actions (like db stuff) via
callFromThread(), but I'll ignore that detail for now.)
So, presuming your Python code looks something like this:
for event in aio.events():
if event.type == EventType.DataReceived:
...
elif event.type == ...
Let's talk about how aio.events() would work. I mentioned that the
background threads, once they've completed processing their event,
push the event details and data onto an interlocked list.
I got this idea from the interlocked list methods available in
Windows since XP. They facilitate synchronized access to a singly
linked list without the need for explicit mutexes. Some sample
methods:
InterlockedPopEntrySList
InterlockedPushEntrySList
InterlockedFlushSList
More info: http://msdn.microsoft.com/en-us/library/windows/desktop/ms684121(v=vs.85).aspx
So, the last thing a background thread does before going back to
poll GetQueuedCompletionStatus/get_event is an interlocked push
onto a global list. What would it push? Depends on the event,
at the least, an event identifier, at the most, an event identifier
and pointer to the char * buffer allocated by the thread, perhaps?
Now, when aio.events() is called, we have some C code that does
an interlocked flush -- this basically pops all the entries off
the list in an interlocked, atomic fashion.
It then loops over all the events and creates the necessary CPython
objects that can then be used in the subsequent Python code. So,
for data received, it would call PyBytesBuffer_FromString(...) with
the buffer indicated in the event, then free() that chunk of memory.
(That was just the first idea that came to my mind; there are
probably tons of better ways to do it in practice. The point
is that however its done, the end result is a GC-tracked object
with no memory leaks from the background thread buffer alloc.
Perhaps there could be a separate interlocked list of shared
buffers that the background threads pop from when they need
a buffer, and the Python aio.events() code pushes to when
it is done converting the buffer into a CPython object.
....and then down the track, subsequent optimizations that
allow the CPython object to inherit the buffer for its life-
time, removing the need to constantly copy data from back-
ground thread buffers to CPython buffers.)
And... I think that's the crux of it really. Key points are actual
asynchronous IO, carried out by threads that aren't GIL constrained
and thus, able to run concurrently -- coupled with a pretty simple
Python interface.
Now, follow-on ideas from that core premise: the longer we can stay
in C land, the more performant the solution. Putting that little
tidbit aside, I want to mention Twisted again, because I think their
protocol approach is spot on:
class Echo(protocol.Protocol):
def dataReceived(self, data):
self.transport.write(data)
def lineReceived(self, line):
self.transport.write(line)
That's a completely nonsense example, as you wouldn't have both a
lineReceived and dataReceived method, but it illustrates the point
of writing classes that are driven by events.
As for maintaining how long we can stay in C versus Python, consider
serving a HTTP request. You accept a connection, wait for headers,
then send headers+data back. From a Python programmer's perspective
you don't really care if data has been read unless you've received
the entire, well-formed set of headers from the client.
For an SMTP server, it's more chatty, read a line, send a line back,
read a few more lines, send some more lines back.
In both of these cases, you could apply some post-processing of data
in C, perhaps as a simple regex match to start with. It would be
pretty easy to regex match an incoming HELO/GET line, queue the well
formed ones for processing by aio.events(), and automatically send
pre-registered errors back for those that don't match.
Things like accept filters on BSD work like this; i.e. don't return
back to the calling code until there's a legitimate event. It
greatly simplifies the eventual Python implementation, too. Rather
than write your own aio.events()-based event loop, you'd take the
Twisted approach and register your protocol handlers with a global
"reactor" that is responsible for processing the raw aio.events()
and then invoking the relevant method on your class instances.
So, let's assume that's all implemented and working in 3.4. The
drawback of this approach is that even though we've allowed for
some actual threaded concurrency via background IO threads, the
main Python code that loops over aio.events() is still limited
to executing on a single core. Albeit, in a very tight loop that
never blocks and would probably be able to process an insane number
of events per second when pegging a single core at 100%.
So, that's 3.4. Perhaps in 3.5 we could add automatic support for
multiprocessing once the number of events per-poll reach a certain
threshold. The event loop automatically spreads out the processing
of events via multiprocessing, facilitating multiple core usage both
via background threads *and* Python code. (And we could probably do
some optimizations such that the background IO thread always queues
up events for the same multiprocessing instance -- which would yield
even more benefits if we had fancy "buffer inheritance" stuff that
removes the need to continually copy data from the background IO
buffers to the foreground CPython code.)
As an added bonus, by the time 3.5 rolls around, perhaps the Linux
and FreeBSD camps have seen how performant IOCP/Solaris-events can
be and added similar support (the Solaris event API wouldn't be that
hard to port elsewhere for an experienced kernel hacker. It's quite
elegant, and, hey, the source code is available). (We could mimic
it in the mean time with background threads that call epoll/kqueue,
I guess.)
Thoughts? Example code or GTFO? ;-)
Trent.
More information about the Python-ideas
mailing list