[Python-ideas] An alternate approach to async IO

Tue Nov 27 18:49:33 CET 2012

    I was hoping to have some code to go along with this idea, but the
    WSAPoll stuff distracted me, so, I don't have any concrete examples
    just yet.

    As part of (slowly) catching up on all the async IO discussions, I
    reviewed both Twisted's current iocpreactor implementation, as well
    as Richard Oudkerk's IOCP/tulip work:

        http://mail.python.org/pipermail/python-ideas/2012-November/017684.html

    Both implementations caught me a little off-guard.  The Twisted
    iocpreactor appears to drive a 'one shot' iteration of GetQueued-
    CompletionStatus per event loop iteration -- sort of treating it
    just like select/poll (in that you call it once to get a list of
    things to do, do them, then call it again).

    Richard's work sort of works the same way... the proactor drives
    the completion port polling via GetQueuedCompletionStatus.

    From what I know about overlapped IO and IOCP, this seems like a
    really odd way to do things: by *not* using a thread pool, whose
    threads' process_some_io_that_just_completed() methods are auto-
    matically called* by the underlying OS, you're getting all of the
    additional complexity of overlapped IO and IOCP without any of the
    concurrency benefits.

        [*]: On AIX, Solaris and Windows XP/2003, you'd manually spawn
             a bunch of threads and have them call GetQueuedCompletion-
             Status or port_get().  Those methods return when IO is
             available on the given completion port.

             On Windows 7/2008R2, you can leverage the new thread pool
             APIs.  You don't need to create a single thread yourself,
             just spin up a pool and associate the IO completion with
             it -- Windows will automatically manage the underlying
             threads optimally.

             Windows 8/2012 introduces a new API, Registered I/O,
             which leverages pre-registered buffers and completion
             ports to minimize overhead of copying data from kernel
             to user space.  It's pretty nifty.  You use RIO in concert
             with thread pools and IOCP.

    Here's the "idea" I had, with zero working code to back it up:
    what if we had a bunch of threads in the background whose sole
    purpose it was to handle AIO?  On Windows/AIX, they would poll
    GetQueuedCompletionStatus, on Solaris, get_event().

    They're literally raw pthreads and have absolutely nothing to
    do with Python's threading.Thread() stuff.  They exist solely
    in C and can't be interfaced to directly from Python code.

    ....which means they're free to run outside the GIL, and thus,
    multiple cores could be leveraged concurrently.  (Only for
    processing completed I/O, but hey, it's better than nothing.)

    The threads would process the completion port events via C code
    and allocate necessary char * buffers on demand.  Upon completion
    of their processing, let's say, reading 4096 bytes from a socket,
    they push their processed event and data to an interlocked* list,
    then go back to GetQueuedCompletionStatus/get_event.

    You eventually need to process these events from Python code.
    Here's where I think this approach is neat: we could expose a new
    API that's semantically pretty close to how poll() works now.

    Except instead of polling a bunch of non-blocking file descriptors
    to see what's ready for reading/writing, you're simply getting a
    list of events that completed in the background.

    You process the events, then presumably, your event loop starts
    again: grab available events, process them.  The key here is that
    nothing blocks, but not because of non-blocking sockets.

    Nothing blocks because the reads() have already taken place and
    the writes() return immediately.  So your event loop now becomes
    this tight little chunk of code that spins as quickly as it can
    on processing events.  (This would lend itself very well to the
    Twisted notion of deferring blocking actions (like db stuff) via
    callFromThread(), but I'll ignore that detail for now.)

    So, presuming your Python code looks something like this:

        for event in aio.events():
            if event.type == EventType.DataReceived:
                ...
            elif event.type == ...

    Let's talk about how aio.events() would work.  I mentioned that the
    background threads, once they've completed processing their event,
    push the event details and data onto an interlocked list.

    I got this idea from the interlocked list methods available in
    Windows since XP.  They facilitate synchronized access to a singly
    linked list without the need for explicit mutexes.  Some sample
    methods:

        InterlockedPopEntrySList
        InterlockedPushEntrySList
        InterlockedFlushSList

        More info: http://msdn.microsoft.com/en-us/library/windows/desktop/ms684121(v=vs.85).aspx

    So, the last thing a background thread does before going back to
    poll GetQueuedCompletionStatus/get_event is an interlocked push
    onto a global list.  What would it push?  Depends on the event,
    at the least, an event identifier, at the most, an event identifier
    and pointer to the char * buffer allocated by the thread, perhaps?

    Now, when aio.events() is called, we have some C code that does
    an interlocked flush -- this basically pops all the entries off
    the list in an interlocked, atomic fashion.

    It then loops over all the events and creates the necessary CPython
    objects that can then be used in the subsequent Python code.  So,
    for data received, it would call PyBytesBuffer_FromString(...) with
    the buffer indicated in the event, then free() that chunk of memory.

        (That was just the first idea that came to my mind; there are
         probably tons of better ways to do it in practice.  The point
         is that however its done, the end result is a GC-tracked object
         with no memory leaks from the background thread buffer alloc.

         Perhaps there could be a separate interlocked list of shared
         buffers that the background threads pop from when they need
         a buffer, and the Python aio.events() code pushes to when
         it is done converting the buffer into a CPython object.

         ....and then down the track, subsequent optimizations that
         allow the CPython object to inherit the buffer for its life-
         time, removing the need to constantly copy data from back-
         ground thread buffers to CPython buffers.)

    And... I think that's the crux of it really.  Key points are actual
    asynchronous IO, carried out by threads that aren't GIL constrained
    and thus, able to run concurrently -- coupled with a pretty simple
    Python interface.

    Now, follow-on ideas from that core premise: the longer we can stay
    in C land, the more performant the solution.  Putting that little
    tidbit aside, I want to mention Twisted again, because I think their
    protocol approach is spot on:

        class Echo(protocol.Protocol):
            def dataReceived(self, data):
                self.transport.write(data)

            def lineReceived(self, line):
                self.transport.write(line)

    That's a completely nonsense example, as you wouldn't have both a
    lineReceived and dataReceived method, but it illustrates the point
    of writing classes that are driven by events.

    As for maintaining how long we can stay in C versus Python, consider
    serving a HTTP request.  You accept a connection, wait for headers,
    then send headers+data back.  From a Python programmer's perspective
    you don't really care if data has been read unless you've received
    the entire, well-formed set of headers from the client.

    For an SMTP server, it's more chatty, read a line, send a line back,
    read a few more lines, send some more lines back.

    In both of these cases, you could apply some post-processing of data
    in C, perhaps as a simple regex match to start with.  It would be
    pretty easy to regex match an incoming HELO/GET line, queue the well
    formed ones for processing by aio.events(), and automatically send
    pre-registered errors back for those that don't match.

    Things like accept filters on BSD work like this; i.e. don't return
    back to the calling code until there's a legitimate event.  It
    greatly simplifies the eventual Python implementation, too.  Rather
    than write your own aio.events()-based event loop, you'd take the
    Twisted approach and register your protocol handlers with a global
    "reactor" that is responsible for processing the raw aio.events()
    and then invoking the relevant method on your class instances.

    So, let's assume that's all implemented and working in 3.4.  The
    drawback of this approach is that even though we've allowed for
    some actual threaded concurrency via background IO threads, the
    main Python code that loops over aio.events() is still limited
    to executing on a single core.  Albeit, in a very tight loop that
    never blocks and would probably be able to process an insane number
    of events per second when pegging a single core at 100%.

    So, that's 3.4.  Perhaps in 3.5 we could add automatic support for
    multiprocessing once the number of events per-poll reach a certain
    threshold.  The event loop automatically spreads out the processing
    of events via multiprocessing, facilitating multiple core usage both
    via background threads *and* Python code.  (And we could probably do
    some optimizations such that the background IO thread always queues
    up events for the same multiprocessing instance -- which would yield
    even more benefits if we had fancy "buffer inheritance" stuff that
    removes the need to continually copy data from the background IO
    buffers to the foreground CPython code.)

    As an added bonus, by the time 3.5 rolls around, perhaps the Linux
    and FreeBSD camps have seen how performant IOCP/Solaris-events can
    be and added similar support (the Solaris event API wouldn't be that
    hard to port elsewhere for an experienced kernel hacker.  It's quite
    elegant, and, hey, the source code is available).  (We could mimic
    it in the mean time with background threads that call epoll/kqueue,
    I guess.)

    Thoughts?  Example code or GTFO? ;-)

        Trent.