Well, okay, please go benchmark something and don't let my ignorance of async I/O on Windows discourage me. (I suppose you've actually written code like this in C or C++ so you know it all works?)<br><br>It still looks to me like you'll have a hard time keeping 16 cores busy if the premise is that you're doing *some* processing in Python (as opposed to the rather unlikely use case of backing up 1GB files), but it also looks to me that, if your approach works, it could be sliced into (e.g.) a Twisted reactor easily without changing Twisted's high-level interfaces in any way.<br>


<br>Do you have an implementation for the "interlocked list" that you mention?<br><br><div class="gmail_quote">On Wed, Nov 28, 2012 at 2:40 PM, Trent Nelson <span dir="ltr"><<a href="mailto:trent@snakebite.org" target="_blank">trent@snakebite.org</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Wed, Nov 28, 2012 at 01:18:48PM -0800, Guido van Rossum wrote:<br>

>    On Wed, Nov 28, 2012 at 1:02 PM, Trent Nelson <<a href="mailto:trent@snakebite.org">trent@snakebite.org</a>> wrote:<br>

><br>

>      On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:<br>

> > On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <<a href="mailto:trent@snakebite.org">trent@snakebite.org</a>><br>

>      wrote:<br>

> > > Right, so, I'm arguing that with my approach, because the background<br>

> > > IO thread stuff is as optimal as it can be -- more IO events would<br>

> > > be available per event loop iteration, and the latency between the<br>

> > > event occurring versus when the event loop picks it up would be<br>

> > > reduced.  The theory being that that will result in higher through-<br>

> > > put and lower latency in practice.<br>

> > ><br>

> > > Also, from a previous e-mail, this:<br>

> > ><br>

> > >     with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f:<br>

> > >         data = f.read()<br>

> > ><br>

> > > Or even just:<br>

> > ><br>

> > >     with aio.open('/dev/zero', 'rb') as f:<br>

> > >         data = f.read(1024 * 1024 * 1024)<br>

> > ><br>

> > > Would basically complete as fast as it physically possible to read<br>

> > > the bytes off the device.  If you've got 16+ cores, then you'll have<br>

> > > 16 cores able to service IO interrupts in parallel.  So, the overall<br>

> > > time to suck in a chunk of data will be vastly reduced.<br>

> > ><br>

> > > There's no other way to get this sort of performance without taking<br>

> > > my approach.<br>

> ><br>

> > So there's something I fundamentally don't understand. Why do those<br>

> > calls, made synchronously in today's CPython, not already run as fast<br>

> > as you can get the bytes off the device? I assume it's just a transfer<br>

> > from kernel memory to user memory. So what is the advantage of using<br>

> > aio over<br>

> ><br>

> >   with open(<file>, 'rb') as f:<br>

> >       data = f.read()<br>

><br>

>          Ah, right.  That's where the OVERLAPPED aspect comes into play.<br>

>          (Other than Windows and AIX, I don't think any other OS provides<br>

>           an overlapped IO facility?)<br>

><br>

>          The difference being, instead of having one thread writing to a 1GB<br>

>          buffer, 4KB at a time, you have 16 threads writing to an overlapped<br>

>          1GB buffer, 4KB at a time.<br>

><br>

>          (Assuming you have 16+ cores, and IO interrupts are coming in whilst<br>

>           existing threads are still servicing previous completions.)<br>

>              Trent.<br>

><br>

> Aha. So these are kernel threads?<br>

<br>

</div></div>    Sort-of-but-not-really.  In Vista onwards, you don't even work with<br>

    threads directly, you just provide a callback, and Windows does all<br>

    sorts of thread pool magic behind the scenes to allow overlapped IO.<br>

<div class="im"><br>

> Is the bandwidth of the I/O channel really higher than one CPU can<br>

> copy bytes across a user/kernel boundary?<br>

<br>

</div>    Ah, good question!  Sometimes yes, sometimes no.  Depends on the<br>

    hardware.  If you're reading from a single IO source like a file<br>

    on a disk, it would have to be one hell of a fast disk and one<br>

    super slow CPU before that would happen.<br>

<br>

    However, consider this:<br>

<br>

        aio.readfile('1GB-raw.1', buf1)<br>

        aio.readfile('1GB-raw.2', buf2)<br>

        aio.readfile('1GB-raw.3', buf3)<br>

        ...<br>

<div class="im"><br>

        with aio.events() as events:<br>

            for event in events:<br>

</div>                if event.type == EventType.FileReadComplete:<br>

                    aio.writefile(event.fname + '.bak', event.buf)<br>

<br>

                if event.type == EventType.FileWriteComplete:<br>

                    log.debug('backed up ' + event.fname)<br>

<br>

                if event.type == EventType.FileWriteFailed:<br>

                    log.error('failed backed up ' + event.fname)<br>

<br>

    aio.readfile() and writefile() return instantly.  With sufficient<br>

    files being handled in parallel, the ability to have 16+ threads<br>

    handle incoming requests instantly would be very useful.<br>

<br>

    Second beneficial example would be if you're a socket server with<br>

    65k active connections.  New interrupts will continually be pouring<br>

    in whilst you're still in the middle of copying data from a previous<br>

    interrupt.<br>

<br>

    Using my approach, Windows would be free to use as many threads as<br>

    you have cores to service all these incoming requests concurrently.<br>

<br>

    Because the threads are so simple and don't touch any CPython stuff,<br>

    their cache footprint will be very small, which is ideal.  All they<br>

    are doing is copying bytes then a quick interlocked list push, so<br>

    they'll run extremely quickly, often within their first quantum,<br>

    which means they're ready to service another request that much<br>

    quicker.<br>

<br>

    An important detail probably worth noting at this point: Windows<br>

    won't spawn more threads than there are cores*.  So, if you've got<br>

    all 16 threads tied up contending for the GIL and messing around<br>

    with PyList_Append() etc, you're going to kill your performance;<br>

    it'll take a lot longer to process new requests because the threads<br>

    take so much longer to do their work.<br>

<br>

    And compare that with the ultimate performance killer of a single<br>

    thread that periodically calls GetQueuedCompletionStatus when it's<br>

    ready to process some IO, and you can see how strange it would seem<br>

    to take that approach.  You're getting all the complexity of IOCP<br>

    and overlapped IO with absolutely none of the benefits.<br>

<span class="HOEnZb"><font color="#888888"><br>

<br>

        Trent.<br>

</font></span></blockquote></div><br><br clear="all"><br>-- <br>--Guido van Rossum (<a href="http://python.org/~guido">python.org/~guido</a>)<br>