[Python-ideas] An alternate approach to async IO

Wed Nov 28 23:40:34 CET 2012

On Wed, Nov 28, 2012 at 01:18:48PM -0800, Guido van Rossum wrote:
>    On Wed, Nov 28, 2012 at 1:02 PM, Trent Nelson <trent at snakebite.org> wrote:
> 
>      On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
> > On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent at snakebite.org>
>      wrote:
> > > Right, so, I'm arguing that with my approach, because the background
> > > IO thread stuff is as optimal as it can be -- more IO events would
> > > be available per event loop iteration, and the latency between the
> > > event occurring versus when the event loop picks it up would be
> > > reduced.  The theory being that that will result in higher through-
> > > put and lower latency in practice.
> > >
> > > Also, from a previous e-mail, this:
> > >
> > >     with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f:
> > >         data = f.read()
> > >
> > > Or even just:
> > >
> > >     with aio.open('/dev/zero', 'rb') as f:
> > >         data = f.read(1024 * 1024 * 1024)
> > >
> > > Would basically complete as fast as it physically possible to read
> > > the bytes off the device.  If you've got 16+ cores, then you'll have
> > > 16 cores able to service IO interrupts in parallel.  So, the overall
> > > time to suck in a chunk of data will be vastly reduced.
> > >
> > > There's no other way to get this sort of performance without taking
> > > my approach.
> >
> > So there's something I fundamentally don't understand. Why do those
> > calls, made synchronously in today's CPython, not already run as fast
> > as you can get the bytes off the device? I assume it's just a transfer
> > from kernel memory to user memory. So what is the advantage of using
> > aio over
> >
> >   with open(<file>, 'rb') as f:
> >       data = f.read()
> 
>          Ah, right.  That's where the OVERLAPPED aspect comes into play.
>          (Other than Windows and AIX, I don't think any other OS provides
>           an overlapped IO facility?)
> 
>          The difference being, instead of having one thread writing to a 1GB
>          buffer, 4KB at a time, you have 16 threads writing to an overlapped
>          1GB buffer, 4KB at a time.
> 
>          (Assuming you have 16+ cores, and IO interrupts are coming in whilst
>           existing threads are still servicing previous completions.)
>              Trent.
> 
> Aha. So these are kernel threads?

    Sort-of-but-not-really.  In Vista onwards, you don't even work with
    threads directly, you just provide a callback, and Windows does all
    sorts of thread pool magic behind the scenes to allow overlapped IO.

> Is the bandwidth of the I/O channel really higher than one CPU can
> copy bytes across a user/kernel boundary?

    Ah, good question!  Sometimes yes, sometimes no.  Depends on the
    hardware.  If you're reading from a single IO source like a file
    on a disk, it would have to be one hell of a fast disk and one
    super slow CPU before that would happen.

    However, consider this:

        aio.readfile('1GB-raw.1', buf1)
        aio.readfile('1GB-raw.2', buf2)
        aio.readfile('1GB-raw.3', buf3)
        ...

        with aio.events() as events:
            for event in events:
                if event.type == EventType.FileReadComplete:
                    aio.writefile(event.fname + '.bak', event.buf)

                if event.type == EventType.FileWriteComplete:
                    log.debug('backed up ' + event.fname)

                if event.type == EventType.FileWriteFailed:
                    log.error('failed backed up ' + event.fname)

    aio.readfile() and writefile() return instantly.  With sufficient
    files being handled in parallel, the ability to have 16+ threads
    handle incoming requests instantly would be very useful.

    Second beneficial example would be if you're a socket server with
    65k active connections.  New interrupts will continually be pouring
    in whilst you're still in the middle of copying data from a previous
    interrupt.

    Using my approach, Windows would be free to use as many threads as
    you have cores to service all these incoming requests concurrently.

    Because the threads are so simple and don't touch any CPython stuff,
    their cache footprint will be very small, which is ideal.  All they
    are doing is copying bytes then a quick interlocked list push, so
    they'll run extremely quickly, often within their first quantum,
    which means they're ready to service another request that much
    quicker.

    An important detail probably worth noting at this point: Windows
    won't spawn more threads than there are cores*.  So, if you've got
    all 16 threads tied up contending for the GIL and messing around
    with PyList_Append() etc, you're going to kill your performance;
    it'll take a lot longer to process new requests because the threads
    take so much longer to do their work.

    And compare that with the ultimate performance killer of a single
    thread that periodically calls GetQueuedCompletionStatus when it's
    ready to process some IO, and you can see how strange it would seem
    to take that approach.  You're getting all the complexity of IOCP
    and overlapped IO with absolutely none of the benefits.


        Trent.