An alternate approach to async IO
I was hoping to have some code to go along with this idea, but the WSAPoll stuff distracted me, so, I don't have any concrete examples just yet. As part of (slowly) catching up on all the async IO discussions, I reviewed both Twisted's current iocpreactor implementation, as well as Richard Oudkerk's IOCP/tulip work: http://mail.python.org/pipermail/python-ideas/2012-November/017684.html Both implementations caught me a little off-guard. The Twisted iocpreactor appears to drive a 'one shot' iteration of GetQueued- CompletionStatus per event loop iteration -- sort of treating it just like select/poll (in that you call it once to get a list of things to do, do them, then call it again). Richard's work sort of works the same way... the proactor drives the completion port polling via GetQueuedCompletionStatus. From what I know about overlapped IO and IOCP, this seems like a really odd way to do things: by *not* using a thread pool, whose threads' process_some_io_that_just_completed() methods are auto- matically called* by the underlying OS, you're getting all of the additional complexity of overlapped IO and IOCP without any of the concurrency benefits. [*]: On AIX, Solaris and Windows XP/2003, you'd manually spawn a bunch of threads and have them call GetQueuedCompletion- Status or port_get(). Those methods return when IO is available on the given completion port. On Windows 7/2008R2, you can leverage the new thread pool APIs. You don't need to create a single thread yourself, just spin up a pool and associate the IO completion with it -- Windows will automatically manage the underlying threads optimally. Windows 8/2012 introduces a new API, Registered I/O, which leverages pre-registered buffers and completion ports to minimize overhead of copying data from kernel to user space. It's pretty nifty. You use RIO in concert with thread pools and IOCP. Here's the "idea" I had, with zero working code to back it up: what if we had a bunch of threads in the background whose sole purpose it was to handle AIO? On Windows/AIX, they would poll GetQueuedCompletionStatus, on Solaris, get_event(). They're literally raw pthreads and have absolutely nothing to do with Python's threading.Thread() stuff. They exist solely in C and can't be interfaced to directly from Python code. ....which means they're free to run outside the GIL, and thus, multiple cores could be leveraged concurrently. (Only for processing completed I/O, but hey, it's better than nothing.) The threads would process the completion port events via C code and allocate necessary char * buffers on demand. Upon completion of their processing, let's say, reading 4096 bytes from a socket, they push their processed event and data to an interlocked* list, then go back to GetQueuedCompletionStatus/get_event. You eventually need to process these events from Python code. Here's where I think this approach is neat: we could expose a new API that's semantically pretty close to how poll() works now. Except instead of polling a bunch of non-blocking file descriptors to see what's ready for reading/writing, you're simply getting a list of events that completed in the background. You process the events, then presumably, your event loop starts again: grab available events, process them. The key here is that nothing blocks, but not because of non-blocking sockets. Nothing blocks because the reads() have already taken place and the writes() return immediately. So your event loop now becomes this tight little chunk of code that spins as quickly as it can on processing events. (This would lend itself very well to the Twisted notion of deferring blocking actions (like db stuff) via callFromThread(), but I'll ignore that detail for now.) So, presuming your Python code looks something like this: for event in aio.events(): if event.type == EventType.DataReceived: ... elif event.type == ... Let's talk about how aio.events() would work. I mentioned that the background threads, once they've completed processing their event, push the event details and data onto an interlocked list. I got this idea from the interlocked list methods available in Windows since XP. They facilitate synchronized access to a singly linked list without the need for explicit mutexes. Some sample methods: InterlockedPopEntrySList InterlockedPushEntrySList InterlockedFlushSList More info: http://msdn.microsoft.com/en-us/library/windows/desktop/ms684121(v=vs.85).as... So, the last thing a background thread does before going back to poll GetQueuedCompletionStatus/get_event is an interlocked push onto a global list. What would it push? Depends on the event, at the least, an event identifier, at the most, an event identifier and pointer to the char * buffer allocated by the thread, perhaps? Now, when aio.events() is called, we have some C code that does an interlocked flush -- this basically pops all the entries off the list in an interlocked, atomic fashion. It then loops over all the events and creates the necessary CPython objects that can then be used in the subsequent Python code. So, for data received, it would call PyBytesBuffer_FromString(...) with the buffer indicated in the event, then free() that chunk of memory. (That was just the first idea that came to my mind; there are probably tons of better ways to do it in practice. The point is that however its done, the end result is a GC-tracked object with no memory leaks from the background thread buffer alloc. Perhaps there could be a separate interlocked list of shared buffers that the background threads pop from when they need a buffer, and the Python aio.events() code pushes to when it is done converting the buffer into a CPython object. ....and then down the track, subsequent optimizations that allow the CPython object to inherit the buffer for its life- time, removing the need to constantly copy data from back- ground thread buffers to CPython buffers.) And... I think that's the crux of it really. Key points are actual asynchronous IO, carried out by threads that aren't GIL constrained and thus, able to run concurrently -- coupled with a pretty simple Python interface. Now, follow-on ideas from that core premise: the longer we can stay in C land, the more performant the solution. Putting that little tidbit aside, I want to mention Twisted again, because I think their protocol approach is spot on: class Echo(protocol.Protocol): def dataReceived(self, data): self.transport.write(data) def lineReceived(self, line): self.transport.write(line) That's a completely nonsense example, as you wouldn't have both a lineReceived and dataReceived method, but it illustrates the point of writing classes that are driven by events. As for maintaining how long we can stay in C versus Python, consider serving a HTTP request. You accept a connection, wait for headers, then send headers+data back. From a Python programmer's perspective you don't really care if data has been read unless you've received the entire, well-formed set of headers from the client. For an SMTP server, it's more chatty, read a line, send a line back, read a few more lines, send some more lines back. In both of these cases, you could apply some post-processing of data in C, perhaps as a simple regex match to start with. It would be pretty easy to regex match an incoming HELO/GET line, queue the well formed ones for processing by aio.events(), and automatically send pre-registered errors back for those that don't match. Things like accept filters on BSD work like this; i.e. don't return back to the calling code until there's a legitimate event. It greatly simplifies the eventual Python implementation, too. Rather than write your own aio.events()-based event loop, you'd take the Twisted approach and register your protocol handlers with a global "reactor" that is responsible for processing the raw aio.events() and then invoking the relevant method on your class instances. So, let's assume that's all implemented and working in 3.4. The drawback of this approach is that even though we've allowed for some actual threaded concurrency via background IO threads, the main Python code that loops over aio.events() is still limited to executing on a single core. Albeit, in a very tight loop that never blocks and would probably be able to process an insane number of events per second when pegging a single core at 100%. So, that's 3.4. Perhaps in 3.5 we could add automatic support for multiprocessing once the number of events per-poll reach a certain threshold. The event loop automatically spreads out the processing of events via multiprocessing, facilitating multiple core usage both via background threads *and* Python code. (And we could probably do some optimizations such that the background IO thread always queues up events for the same multiprocessing instance -- which would yield even more benefits if we had fancy "buffer inheritance" stuff that removes the need to continually copy data from the background IO buffers to the foreground CPython code.) As an added bonus, by the time 3.5 rolls around, perhaps the Linux and FreeBSD camps have seen how performant IOCP/Solaris-events can be and added similar support (the Solaris event API wouldn't be that hard to port elsewhere for an experienced kernel hacker. It's quite elegant, and, hey, the source code is available). (We could mimic it in the mean time with background threads that call epoll/kqueue, I guess.) Thoughts? Example code or GTFO? ;-) Trent.
On 27/11/2012 5:49pm, Trent Nelson wrote:
Here's the "idea" I had, with zero working code to back it up: what if we had a bunch of threads in the background whose sole purpose it was to handle AIO? On Windows/AIX, they would poll GetQueuedCompletionStatus, on Solaris, get_event().
They're literally raw pthreads and have absolutely nothing to do with Python's threading.Thread() stuff. They exist solely in C and can't be interfaced to directly from Python code.
....which means they're free to run outside the GIL, and thus, multiple cores could be leveraged concurrently. (Only for processing completed I/O, but hey, it's better than nothing.)
The threads would process the completion port events via C code and allocate necessary char * buffers on demand. Upon completion of their processing, let's say, reading 4096 bytes from a socket, they push their processed event and data to an interlocked* list, then go back to GetQueuedCompletionStatus/get_event.
But you have to allocate the buffer *before* you initiate an overlapped read. And you may as well make that buffer a Python bytes object (which can be shrunk if it is too large). That leaves no "processing" that can usefully be done by a C level thread pool. Also, note that (at least on Windows) overlapped IO automatically makes use of a hidden thread pool to do the IO. -- Richard
On Tue, Nov 27, 2012 at 10:30:10AM -0800, Richard Oudkerk wrote:
On 27/11/2012 5:49pm, Trent Nelson wrote:
Here's the "idea" I had, with zero working code to back it up: what if we had a bunch of threads in the background whose sole purpose it was to handle AIO? On Windows/AIX, they would poll GetQueuedCompletionStatus, on Solaris, get_event().
They're literally raw pthreads and have absolutely nothing to do with Python's threading.Thread() stuff. They exist solely in C and can't be interfaced to directly from Python code.
....which means they're free to run outside the GIL, and thus, multiple cores could be leveraged concurrently. (Only for processing completed I/O, but hey, it's better than nothing.)
The threads would process the completion port events via C code and allocate necessary char * buffers on demand. Upon completion of their processing, let's say, reading 4096 bytes from a socket, they push their processed event and data to an interlocked* list, then go back to GetQueuedCompletionStatus/get_event.
But you have to allocate the buffer *before* you initiate an overlapped read. And you may as well make that buffer a Python bytes object (which can be shrunk if it is too large). That leaves no "processing" that can usefully be done by a C level thread pool.
I'm a little confused by that last sentence. The premise of my idea is being able to service AIO via simple GIL-independent threads that really just copy data from A to B. The simple fact that they don't have to acquire the GIL each time the completion port has an event seems like a big win, no? (So I'm not sure if you're saying that this wouldn't work because you may as well use Python bytes objects, and they can't be accessed willy-nilly from non-GIL threads... or if you're saying they can, but there's no benefit from a C-level thread copying data to/from buffers independent of the GIL.)
Also, note that (at least on Windows) overlapped IO automatically makes use of a hidden thread pool to do the IO.
I don't think that detail impacts my general idea though, right? Trent.
On 27/11/2012 7:59pm, Trent Nelson wrote:
But you have to allocate the buffer*before* you initiate an overlapped read. And you may as well make that buffer a Python bytes object (which can be shrunk if it is too large). That leaves no "processing" that can usefully be done by a C level thread pool. I'm a little confused by that last sentence. The premise of my idea is being able to service AIO via simple GIL-independent threads that really just copy data from A to B. The simple fact that they don't have to acquire the GIL each time the completion port has an event seems like a big win, no?
(So I'm not sure if you're saying that this wouldn't work because you may as well use Python bytes objects, and they can't be accessed willy-nilly from non-GIL threads... or if you're saying they can, but there's no benefit from a C-level thread copying data to/from buffers independent of the GIL.)
I am saying that there is no copying necessary if you use a bytes object as your buffer. You can just use _PyBytes_Resize() afterwards to shrink it if necessary. -- Richard
On Tue, Nov 27, 2012 at 12:13:18PM -0800, Richard Oudkerk wrote:
On 27/11/2012 7:59pm, Trent Nelson wrote:
But you have to allocate the buffer*before* you initiate an overlapped read. And you may as well make that buffer a Python bytes object (which can be shrunk if it is too large). That leaves no "processing" that can usefully be done by a C level thread pool. I'm a little confused by that last sentence. The premise of my idea is being able to service AIO via simple GIL-independent threads that really just copy data from A to B. The simple fact that they don't have to acquire the GIL each time the completion port has an event seems like a big win, no?
(So I'm not sure if you're saying that this wouldn't work because you may as well use Python bytes objects, and they can't be accessed willy-nilly from non-GIL threads... or if you're saying they can, but there's no benefit from a C-level thread copying data to/from buffers independent of the GIL.)
I am saying that there is no copying necessary if you use a bytes object as your buffer. You can just use _PyBytes_Resize() afterwards to shrink it if necessary.
Got it. So what about the "no processing that can be usefully done by a C level thread" bit? I'm trying to discern whether or not you're highlighting a fundamental flaw in the theory/idea ;-) (That it's going to be more optimal to have background threads service IO without the need to acquire the GIL, basically.) Trent.
On Tue, 27 Nov 2012 15:19:46 -0500 Trent Nelson <trent@snakebite.org> wrote:
Got it. So what about the "no processing that can be usefully done by a C level thread" bit? I'm trying to discern whether or not you're highlighting a fundamental flaw in the theory/idea ;-)
(That it's going to be more optimal to have background threads service IO without the need to acquire the GIL, basically.)
You don't need to acquire the GIL if you use the Py_buffer API in the right way. You'll figure out the details :-) Regards Antoine.
Den 27. nov. 2012 kl. 21:54 skrev Antoine Pitrou <solipsis@pitrou.net>:
You don't need to acquire the GIL if you use the Py_buffer API in the right way. You'll figure out the details :-)
And if he don't, this is what Cython's memoryview syntax will help him to do :-) As for the rest of his idea, I think it is fundamentally flawed in two ways: First, there is no fundamental difference between a non-registered thread and a Python thread that has released the GIL. They are both native OS threads running freely without bothering the interpreter. So given that the GIL will br released in an extension module, a list of threading.Thread is much easier to set up than use POSIX or Windows threads from C. Second, using a thread pool to monitor another thread pool (an IOCP on Windows) seems a bit confused. IOCPs can fire a callback function on completion. It is easier to let the IOCP use the simplified GIL API from the callback code to acquire the GIL, and notify the Python code with a callback. Or en even simpler solution is to use twisted (twistedmatrix.org) instead of reinventing the wheel. Sturla
On Tue, Nov 27, 2012 at 01:25:55PM -0800, Sturla Molden wrote:
Den 27. nov. 2012 kl. 21:54 skrev Antoine Pitrou <solipsis@pitrou.net>:
You don't need to acquire the GIL if you use the Py_buffer API in the right way. You'll figure out the details :-)
And if he don't, this is what Cython's memoryview syntax will help him to do :-)
As for the rest of his idea, I think it is fundamentally flawed in two ways:
First, there is no fundamental difference between a non-registered thread and a Python thread that has released the GIL. They are both native OS threads running freely without bothering the interpreter. So given that the GIL will br released in an extension module, a list of threading.Thread is much easier to set up than use POSIX or Windows threads from C.
Second, using a thread pool to monitor another thread pool (an IOCP on Windows) seems a bit confused. IOCPs can fire a callback function on completion. It is easier to let the IOCP use the simplified GIL API from the callback code to acquire the GIL, and notify the Python code with a callback. Or en even simpler solution is to use twisted (twistedmatrix.org) instead of reinventing the wheel.
Hrm, neither of those points are really flaws with my idea, you're just suggesting two different ways of working with IOCP, both of which wouldn't scale as well as a GIL-independent approach. As for reinventing the wheel, I'm not. I'm offering an idea for achieving high performance async IO. In fact, Twisted would run like the dickens if there was an aio.events() API -- they wouldn't need any of the non-blocking magic behind the scenes like they do now to *simulate* an asynchronous, event-driven architecture. Trent.
Den 27. nov. 2012 kl. 22:48 skrev Trent Nelson <trent@snakebite.org>:
Hrm, neither of those points are really flaws with my idea, you're just suggesting two different ways of working with IOCP, both of which wouldn't scale as well as a GIL-independent approach.
CPython is not a GIL-independent approach. You can stack ten thread pools om top of each other, but sooner or later you must call back to Python. sturla
On Tue, Nov 27, 2012 at 02:12:12PM -0800, Sturla Molden wrote:
Den 27. nov. 2012 kl. 22:48 skrev Trent Nelson <trent@snakebite.org>:
Hrm, neither of those points are really flaws with my idea, you're just suggesting two different ways of working with IOCP, both of which wouldn't scale as well as a GIL-independent approach.
CPython is not a GIL-independent approach. You can stack ten thread pools om top of each other, but sooner or later you must call back to Python.
Right, but with things like interlocked lists, you can make that CPython|background_IO synchronization barrier much more performant than relying on GIL acquisition. I think we'll have to agree to disagree at this point; there's not much point arguing further until there's some code+benchmarks on the table. Trent.
Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent@snakebite.org>:
Right, but with things like interlocked lists, you can make that CPython|background_IO synchronization barrier much more performant than relying on GIL acquisition.
You always need the GIL to call back to Python. You don't need it for anything else. Sturla
On Tue, Nov 27, 2012 at 03:33:55PM -0800, Sturla Molden wrote:
Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent@snakebite.org>:
Right, but with things like interlocked lists, you can make that CPython|background_IO synchronization barrier much more performant than relying on GIL acquisition.
You always need the GIL to call back to Python. You don't need it for anything else.
Right, you *currently* need the GIL to call back to Python. I'm proposing an alternate approach that avoids the GIL. On Windows, it would be via interlocked lists. Trent.
Den 28. nov. 2012 kl. 00:41 skrev Trent Nelson <trent@snakebite.org>:
Right, you *currently* need the GIL to call back to Python.
You need the GIL to access the CPython interpreter. You did not suggest anything that will change that.
I'm proposing an alternate approach that avoids the GIL. On Windows, it would be via interlocked lists.
No, you just misunderstood how the GIL works. You seem to think the GIL serializes Python threads. But what it serializes is access to the CPython interpreter. When Python threads process data in C land they don't need the GIL and can run freely. Sturla
On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla@molden.no> wrote:
Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent@snakebite.org>:
Right, but with things like interlocked lists, you can make that CPython|background_IO synchronization barrier much more performant than relying on GIL acquisition.
You always need the GIL to call back to Python. You don't need it for anything else.
You also need it for any use of an object, even INCREF, unless you know no other thread yet knows about it. -- --Guido van Rossum (python.org/~guido)
Den 28. nov. 2012 kl. 00:50 skrev Guido van Rossum <guido@python.org>:
On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla@molden.no> wrote:
Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent@snakebite.org>:
Right, but with things like interlocked lists, you can make that CPython|background_IO synchronization barrier much more performant than relying on GIL acquisition.
You always need the GIL to call back to Python. You don't need it for anything else.
You also need it for any use of an object, even INCREF, unless you know no other thread yet knows about it.
Yes. But you don't need it to call Windows API functions like GetQueuedCompletionStatus, which is what Trent implied. (Actually, calling GetQueuedCompletionStatus while holding the GIL would probably cause a deadlock.) Sturla
On Tue, Nov 27, 2012 at 03:50:34PM -0800, Guido van Rossum wrote:
On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla@molden.no> wrote:
Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent@snakebite.org>:
Right, but with things like interlocked lists, you can make that CPython|background_IO synchronization barrier much more performant than relying on GIL acquisition.
You always need the GIL to call back to Python. You don't need it for anything else.
You also need it for any use of an object, even INCREF, unless you know no other thread yet knows about it.
Right, that's why I proposed using non-Python types as buffers whilst in the background IO threads. Once the thread finishes processing the event, it pushes the necessary details onto a global interlocked list. ("Details" being event type and possibly a data buffer if the event was 'data received'.) Then, when aio.events() is called, CPython code (holding the GIL) does an interlocked/atomic flush/pop_all, creates the relevant Python objects for the events, and returns them in a list for the calling code to iterate over. The rationale for all of this is that this approach should scale better when heavily loaded (i.e. tens of thousands of connections and/or Gb/s traffic). When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL. (That's the theory, at least.) Trent.
On Tue, Nov 27, 2012 at 4:15 PM, Trent Nelson <trent@snakebite.org> wrote:
The rationale for all of this is that this approach should scale better when heavily loaded (i.e. tens of thousands of connections and/or Gb/s traffic). When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
(That's the theory, at least.)
But why would you need 15 cores to shuffle the bytes around when you have only 1 to run the Python code that responds to those bytes? -- --Guido van Rossum (python.org/~guido)
Den 28. nov. 2012 kl. 01:44 skrev Guido van Rossum <guido@python.org>:
But why would you need 15 cores to shuffle the bytes around when you have only 1 to run the Python code that responds to those bytes?
I don't think he needs all 15 cores for that, assuming the Python processing is the more expensive. But it would avoid having the GIL change hands on each asynch i/o event. It would also keep the thread that runs the interpreter in cache, if it never goes to sleep. That way objects associated with the interpreter would not be swapped in and out, but kept in cache. Sturla
On Tue, Nov 27, 2012 at 04:44:05PM -0800, Guido van Rossum wrote:
On Tue, Nov 27, 2012 at 4:15 PM, Trent Nelson <trent@snakebite.org> wrote:
The rationale for all of this is that this approach should scale better when heavily loaded (i.e. tens of thousands of connections and/or Gb/s traffic). When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
(That's the theory, at least.)
But why would you need 15 cores to shuffle the bytes around when you have only 1 to run the Python code that responds to those bytes?
There are a few advantages. For one, something like this: with aio.open('1GB-file-on-a-fast-SSD.raw', 'r') as f: data = f.read() Or even just: with aio.open('/dev/zero', 'rb') as f: data = f.read(1024 * 1024 * 1024) Would basically complete as fast as it physically possible to read the bytes off the device. That's pretty cool. Ditto for write. Sturla touched on some of the other advantages regarding cache locality, reduced context switching and absence of any lock contention. When using the `for event in aio.events()` approach, sure, you've only got one Python thread, but nothing blocks, and you'll be able to churn away on as many events per second as a single core allows. On more powerful boxes, you'll eventually hit a limit where the single core event loop can't keep up with the data being serviced by 16+ threads. That's where this chunk of my original e-mail becomes relevant: > So, let's assume that's all implemented and working in 3.4. The > drawback of this approach is that even though we've allowed for > some actual threaded concurrency via background IO threads, the > main Python code that loops over aio.events() is still limited > to executing on a single core. Albeit, in a very tight loop that > never blocks and would probably be able to process an insane number > of events per second when pegging a single core at 100%. > So, that's 3.4. Perhaps in 3.5 we could add automatic support for > multiprocessing once the number of events per-poll reach a certain > threshold. The event loop automatically spreads out the processing > of events via multiprocessing, facilitating multiple core usage both > via background threads *and* Python code. (And we could probably do > some optimizations such that the background IO thread always queues > up events for the same multiprocessing instance -- which would yield > even more benefits if we had fancy "buffer inheritance" stuff that > removes the need to continually copy data from the background IO > buffers to the foreground CPython code.) So, down the track, we could explore options for future scaling via something like multiprocessing when the number of incoming events exceeds the ability of the single core `for event in aio.events()` event loop. Trent.
Den 28. nov. 2012 kl. 01:15 skrev Trent Nelson <trent@snakebite.org>:
Right, that's why I proposed using non-Python types as buffers whilst in the background IO threads. Once the thread finishes processing the event, it pushes the necessary details onto a global interlocked list. ("Details" being event type and possibly a data buffer if the event was 'data received'.)
Then, when aio.events() is called, CPython code (holding the GIL) does an interlocked/atomic flush/pop_all, creates the relevant Python objects for the events, and returns them in a list for the calling code to iterate over.
The rationale for all of this is that this approach should scale better when heavily loaded (i.e. tens of thousands of connections and/or Gb/s traffic). When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
Sure, the GIL is a lot more expensive than a simple mutex (or an interlocked list), so avoiding a GIL-release and reacquire on each io completion event might be an advantage. Sturla
Den 28. nov. 2012 kl. 01:15 skrev Trent Nelson <trent@snakebite.org>:
Right, that's why I proposed using non-Python types as buffers whilst in the background IO threads. Once the thread finishes processing the event, it pushes the necessary details onto a global interlocked list. ("Details" being event type and possibly a data buffer if the event was 'data received'.)
Then, when aio.events() is called, CPython code (holding the GIL) does an interlocked/atomic flush/pop_all, creates the relevant Python objects for the events, and returns them in a list for the calling code to iterate over.
The rationale for all of this is that this approach should scale better when heavily loaded (i.e. tens of thousands of connections and/or Gb/s traffic). When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
Sorry. I changed my mind. I believe you are right after all :-) I see two benefits: 1. It avoids contention for the GIL and avoids excessive context shifts in the CPython interpreter. 2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache. So yes, it might be better after all :-) I don't think it would matter much for multicore scalability, as the Python processing is likely the more expensive part. Sturla
On Tue, Nov 27, 2012 at 05:09:17PM -0800, Sturla Molden wrote:
Den 28. nov. 2012 kl. 01:15 skrev Trent Nelson <trent@snakebite.org>:
Right, that's why I proposed using non-Python types as buffers whilst in the background IO threads. Once the thread finishes processing the event, it pushes the necessary details onto a global interlocked list. ("Details" being event type and possibly a data buffer if the event was 'data received'.)
Then, when aio.events() is called, CPython code (holding the GIL) does an interlocked/atomic flush/pop_all, creates the relevant Python objects for the events, and returns them in a list for the calling code to iterate over.
The rationale for all of this is that this approach should scale better when heavily loaded (i.e. tens of thousands of connections and/or Gb/s traffic). When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
Sorry. I changed my mind. I believe you are right after all :-)
I see two benefits:
1. It avoids contention for the GIL and avoids excessive context shifts in the CPython interpreter.
2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.
So yes, it might be better after all :-)
That's even sweeter to read considering your initial opposition ;-)
I don't think it would matter much for multicore scalability, as the Python processing is likely the more expensive part.
Yeah, there will eventually be a point where the number of incoming events exceeds the number of events that can be processed by the single core event loop. That's where the ideas for distributing event loop processing via multiprocessing come in. That would be way down the track though -- and there is still lots of benefit to having the async background IO stuff and a single thread event loop. Trent.
On Tue, Nov 27, 2012 at 5:09 PM, Sturla Molden <sturla@molden.no> wrote:
Den 28. nov. 2012 kl. 01:15 skrev Trent Nelson <trent@snakebite.org>:
Right, that's why I proposed using non-Python types as buffers whilst in the background IO threads. Once the thread finishes processing the event, it pushes the necessary details onto a global interlocked list. ("Details" being event type and possibly a data buffer if the event was 'data received'.)
Then, when aio.events() is called, CPython code (holding the GIL) does an interlocked/atomic flush/pop_all, creates the relevant Python objects for the events, and returns them in a list for the calling code to iterate over.
The rationale for all of this is that this approach should scale better when heavily loaded (i.e. tens of thousands of connections and/or Gb/s traffic). When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
Sorry. I changed my mind. I believe you are right after all :-)
It's always great to see people change their mind.
I see two benefits:
I may not understand the proposal any more, but...
1. It avoids contention for the GIL and avoids excessive context shifts in the CPython interpreter.
Then why not just have one thread?
2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.
So what code runs in the other threads? I think I'm confused...
So yes, it might be better after all :-)
I don't think it would matter much for multicore scalability, as the Python processing is likely the more expensive part.
To benefit from multicore, you need to find something that requires a lot of CPU time and can be done without holding on to the GIL. If it's not copying bytes, what is it? -- --Guido van Rossum (python.org/~guido)
Den 28. nov. 2012 kl. 03:59 skrev Guido van Rossum <guido@python.org>:
Then why not just have one thread?
Because of the way IOCPs work on Windows: A pool of threads is waiting on the i/o completion port, one thread from the pool is woken up on i/o completion. There is nothing to do about that, it is an event driven thread pool by design. The question is what a thread woken up on io completion shold do. If it uses the simplified GIL API to ensure the GIL, this would mean excessive GIL shifting with 64k i/o tasks on a port: Each time one of the 64k tasks is complete, a thread would ensure the GIL. That is unlikely to be very scalable. So what Trent suggested is to just have these threads enqueue some data about the completed task and go back to sleep. That way the main "Python thread" would never loose the GIL to a thread from the IOCP. Instead it would shortly busy-wait while a completed task is inserted into the queue. Thus synchronization by the GIL is replaced by a spinlock protecting a queue (or an interlocked list on recent Windows versions).
2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.
So what code runs in the other threads? I think I'm confused...
Almost nothing. They sleep on the IOCP, wake up on i/o completion, puts the completed task in a queue, and goes back to waiting/sleeping on the port. But they never attempt to acquire the GIL. Sturla
On Wed, Nov 28, 2012 at 04:07:22AM -0800, Sturla Molden wrote:
Den 28. nov. 2012 kl. 03:59 skrev Guido van Rossum <guido@python.org>:
Then why not just have one thread?
Because of the way IOCPs work on Windows: A pool of threads is waiting on the i/o completion port, one thread from the pool is woken up on i/o completion. There is nothing to do about that, it is an event driven thread pool by design.
The question is what a thread woken up on io completion shold do. If it uses the simplified GIL API to ensure the GIL, this would mean excessive GIL shifting with 64k i/o tasks on a port: Each time one of the 64k tasks is complete, a thread would ensure the GIL. That is unlikely to be very scalable.
So what Trent suggested is to just have these threads enqueue some data about the completed task and go back to sleep.
That way the main "Python thread" would never loose the GIL to a thread from the IOCP. Instead it would shortly busy-wait while a completed task is inserted into the queue. Thus synchronization by the GIL is replaced by a spinlock protecting a queue (or an interlocked list on recent Windows versions).
2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.
So what code runs in the other threads? I think I'm confused...
Almost nothing. They sleep on the IOCP, wake up on i/o completion, puts the completed task in a queue, and goes back to waiting/sleeping on the port. But they never attempt to acquire the GIL.
Couldn't have said it better myself. Nice to have you on board Sturla ;-) Trent.
On 28/11/2012 12:07pm, Sturla Molden wrote:
That way the main "Python thread" would never loose the GIL to a thread from the IOCP. Instead it would shortly busy-wait while a completed task is inserted into the queue. Thus synchronization by the GIL is replaced by a spinlock protecting a queue (or an interlocked list on recent Windows versions).
How do you know the busy wait will be short? If you are worried about the cost of acquiring and releasing the GIL more than necessary, then why not just dequeue as many completion packets as possible at once. (On Vista and later GetQueuedCompletionStatusEx() produces an array of completion packets; on WinXP you can get a similar effect by calling GetQueuedCompletionStatus() in a loop.) I very much doubt that GetQueuedCompletionStatus*() consumes enough cpu time for it to be worth running in a thread pool (particularly if it forces busy waiting on you). -- Richard
On 28.11.2012 14:00, Richard Oudkerk wrote:
If you are worried about the cost of acquiring and releasing the GIL more than necessary, then why not just dequeue as many completion packets as possible at once. (On Vista and later GetQueuedCompletionStatusEx() produces an array of completion packets; on WinXP you can get a similar effect by calling GetQueuedCompletionStatus() in a loop.)
Hmm, yes, perhaps--- One could call GetQueuedCompletionStatusEx in a loop and set dwMilliseconds to 0 (immediate timeout), possibly with a Sleep(0) if the task queue was empty. (Sleep(0) releases the reminder of the time-slice, and so prevents spinning on GetQueuedCompletionStatusEx from burning the CPU.) If after a while (say 10 ms) there is still no tasks in the queue, we release the GIL and call GetQueuedCompletionStatusEx with a longer time-out than 0. Sturla
OK, now I see. (I thought that was how everyone was using IOCP. Apparently not?) However, the "short busy wait" worries me. What if your app *doesn't* get a lot of requests? Isn't the alternative to have a "thread pool" with just one thread, which runs all the Python code and gets woken up by IOCP when it is idle and there is a new event? How is Trent's proposal an improvement? On Wed, Nov 28, 2012 at 4:07 AM, Sturla Molden <sturla@molden.no> wrote:
Den 28. nov. 2012 kl. 03:59 skrev Guido van Rossum <guido@python.org>:
Then why not just have one thread?
Because of the way IOCPs work on Windows: A pool of threads is waiting on the i/o completion port, one thread from the pool is woken up on i/o completion. There is nothing to do about that, it is an event driven thread pool by design.
The question is what a thread woken up on io completion shold do. If it uses the simplified GIL API to ensure the GIL, this would mean excessive GIL shifting with 64k i/o tasks on a port: Each time one of the 64k tasks is complete, a thread would ensure the GIL. That is unlikely to be very scalable.
So what Trent suggested is to just have these threads enqueue some data about the completed task and go back to sleep.
That way the main "Python thread" would never loose the GIL to a thread from the IOCP. Instead it would shortly busy-wait while a completed task is inserted into the queue. Thus synchronization by the GIL is replaced by a spinlock protecting a queue (or an interlocked list on recent Windows versions).
2. It potentially keeps the thread that runs the CPython interpreter in cache, as it is always active. And thus it also keeps the objects associated with the CPython interpreter in cache.
So what code runs in the other threads? I think I'm confused...
Almost nothing. They sleep on the IOCP, wake up on i/o completion, puts the completed task in a queue, and goes back to waiting/sleeping on the port. But they never attempt to acquire the GIL.
Sturla
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- --Guido van Rossum (python.org/~guido)
On 28.11.2012 16:59, Guido van Rossum wrote:
OK, now I see. (I thought that was how everyone was using IOCP. Apparently not?) However, the "short busy wait" worries me. What if your app *doesn't* get a lot of requests?
Then there would be no busy wait.
Isn't the alternative to have a "thread pool" with just one thread, which runs all the Python code and gets woken up by IOCP when it is idle and there is a new event? How is Trent's proposal an improvement?
I'm not sure. This is what I suggested in my previous mail. Sturla
On Wed, Nov 28, 2012 at 07:59:04AM -0800, Guido van Rossum wrote:
OK, now I see. (I thought that was how everyone was using IOCP. Apparently not?) However, the "short busy wait" worries me. What if your app *doesn't* get a lot of requests?
From my response to Richard's concern re: busy waits: Oooer, that's definitely not what I had in mind. This is how I envisioned it working (think of events() as similar to poll()): with aio.events() as events: for event in events: # process event ... That aio.events() call would result in an InterlockedSListFlush, returning the entire list of available events. It then does the conversion into a CPython event type, bundles everything into a list, then returns. (In reality, there'd be a bit more glue to handle an empty list a bit more gracefully, and probably a timeout to aio.events(). Nothing should involve a spinlock though.)
Isn't the alternative to have a "thread pool" with just one thread, which runs all the Python code and gets woken up by IOCP when it is idle and there is a new event? How is Trent's proposal an improvement?
I don't really understand this suggestion :/ It's sort of in line with how IOCP is used currently, i.e. "let me tell you when I'm ready to process events", which I'm advocating against with this idea. Trent.
On Wed, Nov 28, 2012 at 12:05 PM, Trent Nelson <trent@snakebite.org> wrote:
On Wed, Nov 28, 2012 at 07:59:04AM -0800, Guido van Rossum wrote:
OK, now I see. (I thought that was how everyone was using IOCP. Apparently not?) However, the "short busy wait" worries me. What if your app *doesn't* get a lot of requests?
From my response to Richard's concern re: busy waits:
Oooer, that's definitely not what I had in mind. This is how I envisioned it working (think of events() as similar to poll()):
with aio.events() as events: for event in events: # process event ...
That aio.events() call would result in an InterlockedSListFlush, returning the entire list of available events. It then does the conversion into a CPython event type, bundles everything into a list, then returns.
(In reality, there'd be a bit more glue to handle an empty list a bit more gracefully, and probably a timeout to aio.events(). Nothing should involve a spinlock though.)
Isn't the alternative to have a "thread pool" with just one thread, which runs all the Python code and gets woken up by IOCP when it is idle and there is a new event? How is Trent's proposal an improvement?
I don't really understand this suggestion :/ It's sort of in line with how IOCP is used currently, i.e. "let me tell you when I'm ready to process events", which I'm advocating against with this idea.
Well, but since the proposal also seems to be to keep all Python code in one thread, that thread still has to say when it's ready to process events. So, again, what's the big deal? Maybe we just need benchmarks showing events processed per second for various configurations... -- --Guido van Rossum (python.org/~guido)
On Wed, Nov 28, 2012 at 12:15:22PM -0800, Guido van Rossum wrote:
On Wed, Nov 28, 2012 at 12:05 PM, Trent Nelson <trent@snakebite.org> wrote:
On Wed, Nov 28, 2012 at 07:59:04AM -0800, Guido van Rossum wrote:
OK, now I see. (I thought that was how everyone was using IOCP. Apparently not?) However, the "short busy wait" worries me. What if your app *doesn't* get a lot of requests?
From my response to Richard's concern re: busy waits:
Oooer, that's definitely not what I had in mind. This is how I envisioned it working (think of events() as similar to poll()):
with aio.events() as events: for event in events: # process event ...
That aio.events() call would result in an InterlockedSListFlush, returning the entire list of available events. It then does the conversion into a CPython event type, bundles everything into a list, then returns.
(In reality, there'd be a bit more glue to handle an empty list a bit more gracefully, and probably a timeout to aio.events(). Nothing should involve a spinlock though.)
Isn't the alternative to have a "thread pool" with just one thread, which runs all the Python code and gets woken up by IOCP when it is idle and there is a new event? How is Trent's proposal an improvement?
I don't really understand this suggestion :/ It's sort of in line with how IOCP is used currently, i.e. "let me tell you when I'm ready to process events", which I'm advocating against with this idea.
Well, but since the proposal also seems to be to keep all Python code in one thread, that thread still has to say when it's ready to process events.
Right, so, I'm arguing that with my approach, because the background IO thread stuff is as optimal as it can be -- more IO events would be available per event loop iteration, and the latency between the event occurring versus when the event loop picks it up would be reduced. The theory being that that will result in higher through- put and lower latency in practice. Also, from a previous e-mail, this: with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f: data = f.read() Or even just: with aio.open('/dev/zero', 'rb') as f: data = f.read(1024 * 1024 * 1024) Would basically complete as fast as it physically possible to read the bytes off the device. If you've got 16+ cores, then you'll have 16 cores able to service IO interrupts in parallel. So, the overall time to suck in a chunk of data will be vastly reduced. There's no other way to get this sort of performance without taking my approach.
So, again, what's the big deal? Maybe we just need benchmarks showing events processed per second for various configurations...
Definitely agree with the need for benchmarks. (I'm going to set up an 8-core Snakebite box w/ Windows 2012 server specifically for this purpose, I think.) Trent.
On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent@snakebite.org> wrote:
Right, so, I'm arguing that with my approach, because the background IO thread stuff is as optimal as it can be -- more IO events would be available per event loop iteration, and the latency between the event occurring versus when the event loop picks it up would be reduced. The theory being that that will result in higher through- put and lower latency in practice.
Also, from a previous e-mail, this:
with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f: data = f.read()
Or even just:
with aio.open('/dev/zero', 'rb') as f: data = f.read(1024 * 1024 * 1024)
Would basically complete as fast as it physically possible to read the bytes off the device. If you've got 16+ cores, then you'll have 16 cores able to service IO interrupts in parallel. So, the overall time to suck in a chunk of data will be vastly reduced.
There's no other way to get this sort of performance without taking my approach.
So there's something I fundamentally don't understand. Why do those calls, made synchronously in today's CPython, not already run as fast as you can get the bytes off the device? I assume it's just a transfer from kernel memory to user memory. So what is the advantage of using aio over with open(<file>, 'rb') as f: data = f.read() ? Is it just that you can run other Python code *while* the I/O is happening as fast as possible in a separate thread? But that would also be the case when using (real) threads in today's CPython. What am I missing? -- --Guido van Rossum (python.org/~guido)
On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent@snakebite.org> wrote:
Right, so, I'm arguing that with my approach, because the background IO thread stuff is as optimal as it can be -- more IO events would be available per event loop iteration, and the latency between the event occurring versus when the event loop picks it up would be reduced. The theory being that that will result in higher through- put and lower latency in practice.
Also, from a previous e-mail, this:
with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f: data = f.read()
Or even just:
with aio.open('/dev/zero', 'rb') as f: data = f.read(1024 * 1024 * 1024)
Would basically complete as fast as it physically possible to read the bytes off the device. If you've got 16+ cores, then you'll have 16 cores able to service IO interrupts in parallel. So, the overall time to suck in a chunk of data will be vastly reduced.
There's no other way to get this sort of performance without taking my approach.
So there's something I fundamentally don't understand. Why do those calls, made synchronously in today's CPython, not already run as fast as you can get the bytes off the device? I assume it's just a transfer from kernel memory to user memory. So what is the advantage of using aio over
with open(<file>, 'rb') as f: data = f.read()
Ah, right. That's where the OVERLAPPED aspect comes into play. (Other than Windows and AIX, I don't think any other OS provides an overlapped IO facility?) The difference being, instead of having one thread writing to a 1GB buffer, 4KB at a time, you have 16 threads writing to an overlapped 1GB buffer, 4KB at a time. (Assuming you have 16+ cores, and IO interrupts are coming in whilst existing threads are still servicing previous completions.) Trent.
On Wed, Nov 28, 2012 at 1:02 PM, Trent Nelson <trent@snakebite.org> wrote:
On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent@snakebite.org> wrote:
Right, so, I'm arguing that with my approach, because the
background
IO thread stuff is as optimal as it can be -- more IO events would be available per event loop iteration, and the latency between the event occurring versus when the event loop picks it up would be reduced. The theory being that that will result in higher through- put and lower latency in practice.
Also, from a previous e-mail, this:
with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f: data = f.read()
Or even just:
with aio.open('/dev/zero', 'rb') as f: data = f.read(1024 * 1024 * 1024)
Would basically complete as fast as it physically possible to read the bytes off the device. If you've got 16+ cores, then you'll
have
16 cores able to service IO interrupts in parallel. So, the
overall
time to suck in a chunk of data will be vastly reduced.
There's no other way to get this sort of performance without taking my approach.
So there's something I fundamentally don't understand. Why do those calls, made synchronously in today's CPython, not already run as fast as you can get the bytes off the device? I assume it's just a transfer from kernel memory to user memory. So what is the advantage of using aio over
with open(<file>, 'rb') as f: data = f.read()
Ah, right. That's where the OVERLAPPED aspect comes into play. (Other than Windows and AIX, I don't think any other OS provides an overlapped IO facility?)
The difference being, instead of having one thread writing to a 1GB buffer, 4KB at a time, you have 16 threads writing to an overlapped 1GB buffer, 4KB at a time.
(Assuming you have 16+ cores, and IO interrupts are coming in whilst existing threads are still servicing previous completions.)
Trent.
Aha. So these are kernel threads? Is the bandwidth of the I/O channel really higher than one CPU can copy bytes across a user/kernel boundary? -- --Guido van Rossum (python.org/~guido)
On Wed, Nov 28, 2012 at 01:18:48PM -0800, Guido van Rossum wrote:
On Wed, Nov 28, 2012 at 1:02 PM, Trent Nelson <trent@snakebite.org> wrote:
On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent@snakebite.org> wrote:
Right, so, I'm arguing that with my approach, because the background IO thread stuff is as optimal as it can be -- more IO events would be available per event loop iteration, and the latency between the event occurring versus when the event loop picks it up would be reduced. The theory being that that will result in higher through- put and lower latency in practice.
Also, from a previous e-mail, this:
with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f: data = f.read()
Or even just:
with aio.open('/dev/zero', 'rb') as f: data = f.read(1024 * 1024 * 1024)
Would basically complete as fast as it physically possible to read the bytes off the device. If you've got 16+ cores, then you'll have 16 cores able to service IO interrupts in parallel. So, the overall time to suck in a chunk of data will be vastly reduced.
There's no other way to get this sort of performance without taking my approach.
So there's something I fundamentally don't understand. Why do those calls, made synchronously in today's CPython, not already run as fast as you can get the bytes off the device? I assume it's just a transfer from kernel memory to user memory. So what is the advantage of using aio over
with open(<file>, 'rb') as f: data = f.read()
Ah, right. That's where the OVERLAPPED aspect comes into play. (Other than Windows and AIX, I don't think any other OS provides an overlapped IO facility?)
The difference being, instead of having one thread writing to a 1GB buffer, 4KB at a time, you have 16 threads writing to an overlapped 1GB buffer, 4KB at a time.
(Assuming you have 16+ cores, and IO interrupts are coming in whilst existing threads are still servicing previous completions.) Trent.
Aha. So these are kernel threads?
Sort-of-but-not-really. In Vista onwards, you don't even work with threads directly, you just provide a callback, and Windows does all sorts of thread pool magic behind the scenes to allow overlapped IO.
Is the bandwidth of the I/O channel really higher than one CPU can copy bytes across a user/kernel boundary?
Ah, good question! Sometimes yes, sometimes no. Depends on the hardware. If you're reading from a single IO source like a file on a disk, it would have to be one hell of a fast disk and one super slow CPU before that would happen. However, consider this: aio.readfile('1GB-raw.1', buf1) aio.readfile('1GB-raw.2', buf2) aio.readfile('1GB-raw.3', buf3) ... with aio.events() as events: for event in events: if event.type == EventType.FileReadComplete: aio.writefile(event.fname + '.bak', event.buf) if event.type == EventType.FileWriteComplete: log.debug('backed up ' + event.fname) if event.type == EventType.FileWriteFailed: log.error('failed backed up ' + event.fname) aio.readfile() and writefile() return instantly. With sufficient files being handled in parallel, the ability to have 16+ threads handle incoming requests instantly would be very useful. Second beneficial example would be if you're a socket server with 65k active connections. New interrupts will continually be pouring in whilst you're still in the middle of copying data from a previous interrupt. Using my approach, Windows would be free to use as many threads as you have cores to service all these incoming requests concurrently. Because the threads are so simple and don't touch any CPython stuff, their cache footprint will be very small, which is ideal. All they are doing is copying bytes then a quick interlocked list push, so they'll run extremely quickly, often within their first quantum, which means they're ready to service another request that much quicker. An important detail probably worth noting at this point: Windows won't spawn more threads than there are cores*. So, if you've got all 16 threads tied up contending for the GIL and messing around with PyList_Append() etc, you're going to kill your performance; it'll take a lot longer to process new requests because the threads take so much longer to do their work. And compare that with the ultimate performance killer of a single thread that periodically calls GetQueuedCompletionStatus when it's ready to process some IO, and you can see how strange it would seem to take that approach. You're getting all the complexity of IOCP and overlapped IO with absolutely none of the benefits. Trent.
On 28/11/2012 10:40pm, Trent Nelson wrote:
And compare that with the ultimate performance killer of a single thread that periodically calls GetQueuedCompletionStatus when it's ready to process some IO, and you can see how strange it would seem to take that approach. You're getting all the complexity of IOCP and overlapped IO with absolutely none of the benefits.
BTW, GetQueuedCompletionStatusEx() lets you dequeue an array of messages instead of working one at a time. -- Richard
On Wed, Nov 28, 2012 at 02:52:49PM -0800, Richard Oudkerk wrote:
On 28/11/2012 10:40pm, Trent Nelson wrote:
And compare that with the ultimate performance killer of a single thread that periodically calls GetQueuedCompletionStatus when it's ready to process some IO, and you can see how strange it would seem to take that approach. You're getting all the complexity of IOCP and overlapped IO with absolutely none of the benefits.
BTW, GetQueuedCompletionStatusEx() lets you dequeue an array of messages instead of working one at a time.
Right. The funny thing about that is it's only available in Vista onwards. And if you're on Vista onwards, the new thread pool APIs available, which negate the need to call GetQueuedCompletionStatus* at all. Trent.
On 28/11/2012 11:04pm, Trent Nelson wrote:
Right. The funny thing about that is it's only available in Vista onwards. And if you're on Vista onwards, the new thread pool APIs available, which negate the need to call GetQueuedCompletionStatus* at all.
Trent.
Completion ports are just thread safe queues equivalent to queue.Queue in Python: PostQueuedCompletionStatus() corresponds to Queue.put() and GetQueuedCompletionStatus() corresponds to Queue.get(). They are not specific to IO and can be used for general message passing between threads. I suspect that registering a file handle to a completion port is simply implemented (at least on XP) by using BindIoCompletionCallback() to register a callback which calls PostQueuedCompletionStatus() whenever a operation completes on that handle. You seem to be proposing to do *exactly the same thing*, but using a different queue implementation (and using a Vista only equivalent of BindIoCompletionCallback()). I also suspect that if you try to implement a thread safe queue which does not busy wait, then you will end up having to add synchronization which makes using a interlocked list unnecessary and slower than using a normal list. Will it really superior to the operating systems builtin queue implementation? -- Richard
Well, okay, please go benchmark something and don't let my ignorance of async I/O on Windows discourage me. (I suppose you've actually written code like this in C or C++ so you know it all works?) It still looks to me like you'll have a hard time keeping 16 cores busy if the premise is that you're doing *some* processing in Python (as opposed to the rather unlikely use case of backing up 1GB files), but it also looks to me that, if your approach works, it could be sliced into (e.g.) a Twisted reactor easily without changing Twisted's high-level interfaces in any way. Do you have an implementation for the "interlocked list" that you mention? On Wed, Nov 28, 2012 at 2:40 PM, Trent Nelson <trent@snakebite.org> wrote:
On Wed, Nov 28, 2012 at 01:18:48PM -0800, Guido van Rossum wrote:
On Wed, Nov 28, 2012 at 1:02 PM, Trent Nelson <trent@snakebite.org> wrote:
On Wed, Nov 28, 2012 at 12:49:51PM -0800, Guido van Rossum wrote:
On Wed, Nov 28, 2012 at 12:32 PM, Trent Nelson <trent@snakebite.org> wrote:
Right, so, I'm arguing that with my approach, because the background IO thread stuff is as optimal as it can be -- more IO events would be available per event loop iteration, and the latency between the event occurring versus when the event loop picks it up would be reduced. The theory being that that will result in higher through- put and lower latency in practice.
Also, from a previous e-mail, this:
with aio.open('1GB-file-on-a-fast-SSD.raw', 'rb') as f: data = f.read()
Or even just:
with aio.open('/dev/zero', 'rb') as f: data = f.read(1024 * 1024 * 1024)
Would basically complete as fast as it physically possible to read the bytes off the device. If you've got 16+ cores, then you'll have 16 cores able to service IO interrupts in parallel. So, the overall time to suck in a chunk of data will be vastly reduced.
There's no other way to get this sort of performance without taking my approach.
So there's something I fundamentally don't understand. Why do those calls, made synchronously in today's CPython, not already run as fast as you can get the bytes off the device? I assume it's just a transfer from kernel memory to user memory. So what is the advantage of using aio over
with open(<file>, 'rb') as f: data = f.read()
Ah, right. That's where the OVERLAPPED aspect comes into play. (Other than Windows and AIX, I don't think any other OS provides an overlapped IO facility?)
The difference being, instead of having one thread writing to a 1GB buffer, 4KB at a time, you have 16 threads writing to an overlapped 1GB buffer, 4KB at a time.
(Assuming you have 16+ cores, and IO interrupts are coming in whilst existing threads are still servicing previous completions.) Trent.
Aha. So these are kernel threads?
Sort-of-but-not-really. In Vista onwards, you don't even work with threads directly, you just provide a callback, and Windows does all sorts of thread pool magic behind the scenes to allow overlapped IO.
Is the bandwidth of the I/O channel really higher than one CPU can copy bytes across a user/kernel boundary?
Ah, good question! Sometimes yes, sometimes no. Depends on the hardware. If you're reading from a single IO source like a file on a disk, it would have to be one hell of a fast disk and one super slow CPU before that would happen.
However, consider this:
aio.readfile('1GB-raw.1', buf1) aio.readfile('1GB-raw.2', buf2) aio.readfile('1GB-raw.3', buf3) ...
with aio.events() as events: for event in events: if event.type == EventType.FileReadComplete: aio.writefile(event.fname + '.bak', event.buf)
if event.type == EventType.FileWriteComplete: log.debug('backed up ' + event.fname)
if event.type == EventType.FileWriteFailed: log.error('failed backed up ' + event.fname)
aio.readfile() and writefile() return instantly. With sufficient files being handled in parallel, the ability to have 16+ threads handle incoming requests instantly would be very useful.
Second beneficial example would be if you're a socket server with 65k active connections. New interrupts will continually be pouring in whilst you're still in the middle of copying data from a previous interrupt.
Using my approach, Windows would be free to use as many threads as you have cores to service all these incoming requests concurrently.
Because the threads are so simple and don't touch any CPython stuff, their cache footprint will be very small, which is ideal. All they are doing is copying bytes then a quick interlocked list push, so they'll run extremely quickly, often within their first quantum, which means they're ready to service another request that much quicker.
An important detail probably worth noting at this point: Windows won't spawn more threads than there are cores*. So, if you've got all 16 threads tied up contending for the GIL and messing around with PyList_Append() etc, you're going to kill your performance; it'll take a lot longer to process new requests because the threads take so much longer to do their work.
And compare that with the ultimate performance killer of a single thread that periodically calls GetQueuedCompletionStatus when it's ready to process some IO, and you can see how strange it would seem to take that approach. You're getting all the complexity of IOCP and overlapped IO with absolutely none of the benefits.
Trent.
-- --Guido van Rossum (python.org/~guido)
On Wed, Nov 28, 2012 at 02:52:56PM -0800, Guido van Rossum wrote:
Well, okay, please go benchmark something and don't let my ignorance of async I/O on Windows discourage me.
Great!
(I suppose you've actually written code like this in C or C++ so you know it all works?)
Not recently :-) (I have some helpers though: http://trent.snakebite.net/books.jpg)
It still looks to me like you'll have a hard time keeping 16 cores busy if the premise is that you're doing *some* processing in Python (as opposed to the rather unlikely use case of backing up 1GB files),
Yeah it won't take much for Python to become the single-core bottleneck once this is in place. But we can explore ways to address this down the track. (I still think multiprocessing might be a viable approach for map/reducing all the events over multiple cores.)
but it also looks to me that, if your approach works, it could be sliced into (e.g.) a Twisted reactor easily without changing Twisted's high-level interfaces in any way.
It's almost ridiculous how well suited this approach would be for Twisted. Which is funny, given their aversion to Windows ;-) I spent a few days reviewing their overall code/approach when I was looking at iocpreactor, and it definitely influenced the approach I'm proposing. I haven't mentioned all the other async Windows ops (especially all the new ones in 8) that this approach would also support. Juicy stuff like async accept, getaddrinfo, etc. That would slot straight in just as nicely: if event.type == EventType.GetAddrInfoComplete: ... elif event.type == EventType.AcceptComplete: ...
Do you have an implementation for the "interlocked list" that you mention?
Do MSDN docs count? ;-) http://msdn.microsoft.com/en-us/library/windows/desktop/ms684121(v=vs.85).as... It's a pretty simple API. Shouldn't be too hard to replicate with gcc/clang primitives. Trent.
Trent Nelson wrote:
I'm arguing that with my approach, because the background IO thread stuff is as optimal as it can be -- more IO events would be available per event loop iteration, and the latency between the event occurring versus when the event loop picks it up would be reduced. The theory being that that will result in higher through- put and lower latency in practice.
But the data still as to wait around somewhere until the Python thread gets around to dealing with it. I don't see why it's better for it to sit around in the interlocked list than it is for the completion packets to just wait in the IOCP until the Python thread is ready. -- Greg
On Wed, Nov 28, 2012 at 02:28:35PM -0800, Greg Ewing wrote:
Trent Nelson wrote:
I'm arguing that with my approach, because the background IO thread stuff is as optimal as it can be -- more IO events would be available per event loop iteration, and the latency between the event occurring versus when the event loop picks it up would be reduced. The theory being that that will result in higher through- put and lower latency in practice.
But the data still as to wait around somewhere until the Python thread gets around to dealing with it. I don't see why it's better for it to sit around in the interlocked list than it is for the completion packets to just wait in the IOCP until the Python thread is ready.
Hopefully the response I just sent to Guido makes things a little clearer? I gave a few more examples of where I believe my approach is going to be much better than the single thread approach, which overlaps the concerns you raise here. Trent.
Sturla Molden wrote:
Then why not just have one thread?
Because of the way IOCPs work on Windows: A pool of threads is waiting on the i/o completion port, one thread from the pool is woken up on i/o completion.
But does it *have* to be used that way? The OP claimed that Twisted was using a single thread that explicitly polls the IOCP, instead of an OS-managed thread pool. If that's possible, the question is then whether the extra complexity of having I/O threads wake up the Python thread gains you anything, given that the Python thread will be doing the bulk of the work. -- Greg
Trent Nelson wrote:
When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
I don't understand. Why is vying for access to an interlocked list any less latentful than vying for the GIL? -- Greg
Den 28. nov. 2012 kl. 06:20 skrev Greg Ewing <greg.ewing@canterbury.ac.nz>:
Trent Nelson wrote:
When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
I don't understand. Why is vying for access to an interlocked list any less latentful than vying for the GIL?
Because ensuring and releasing GIL is more expensive than an atomic read/write. And because the thread running Python would go to sleep while an IOCP thread is active, and perhaps be moved out of CPU cache. Sturla
On Tue, Nov 27, 2012 at 09:20:30PM -0800, Greg Ewing wrote:
Trent Nelson wrote:
When you're dealing with that sort of load on a many-core machine (let's say 16+ cores), an interlocked list is going to reduce latency versus 16+ threads constantly vying for the GIL.
I don't understand. Why is vying for access to an interlocked list any less latentful than vying for the GIL?
I think not having to contend with the interpreter would make a big difference under load. A push to an interlocked list will be more performant than having all threads attempt to do GIL acquire -> PyList_Append() -> GIL release. The key to getting high performance (either low latency or high throughput) with the background IO stuff is ensuring the threads complete their work as quickly as possible. The quicker they process an event, the quicker they can process another event, the higher the overall throughput and the lower the overall latency. Doing a GIL acquire/PyList_Append()/GIL release at the end of the event would add a huge overhead that simply would not be present with an interlocked push. Also, as soon as you call back into CPython from the background thread, your cache footprint explodes, which isn't desirable. Trent.
On Tue, 27 Nov 2012 19:15:14 -0500 Trent Nelson <trent@snakebite.org> wrote:
On Tue, Nov 27, 2012 at 03:50:34PM -0800, Guido van Rossum wrote:
On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla@molden.no> wrote:
Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent@snakebite.org>:
Right, but with things like interlocked lists, you can make that CPython|background_IO synchronization barrier much more performant than relying on GIL acquisition.
You always need the GIL to call back to Python. You don't need it for anything else.
You also need it for any use of an object, even INCREF, unless you know no other thread yet knows about it.
Right, that's why I proposed using non-Python types as buffers whilst in the background IO threads.
Trent, once again, please read about Py_buffer. Thanks Antoine.
On Tue, Nov 27, 2012 at 11:02:58PM -0800, Antoine Pitrou wrote:
On Tue, 27 Nov 2012 19:15:14 -0500 Trent Nelson <trent@snakebite.org> wrote:
On Tue, Nov 27, 2012 at 03:50:34PM -0800, Guido van Rossum wrote:
On Tue, Nov 27, 2012 at 3:33 PM, Sturla Molden <sturla@molden.no> wrote:
Den 27. nov. 2012 kl. 23:36 skrev Trent Nelson <trent@snakebite.org>:
Right, but with things like interlocked lists, you can make that CPython|background_IO synchronization barrier much more performant than relying on GIL acquisition.
You always need the GIL to call back to Python. You don't need it for anything else.
You also need it for any use of an object, even INCREF, unless you know no other thread yet knows about it.
Right, that's why I proposed using non-Python types as buffers whilst in the background IO threads.
Trent, once again, please read about Py_buffer.
Sorry, I did see your previous e-mail, honest. Please interpret that sentence as "that's why I proposed using something that doesn't need to hold the GIL in the background IO threads". Where 'something' sounds like it should be Py_buffer ;-) Trent.
Trent Nelson wrote:
So what about the "no processing that can be usefully done by a C level thread" bit? I'm trying to discern whether or not you're highlighting a fundamental flaw in the theory/idea ;-)
You seem to be assuming that the little bit of processing needed to get the data from kernel to user space is going to be significant compared to whatever the Python code is going to do with the data. That seems unlikely. -- Greg
On Tue, Nov 27, 2012 at 12:54:50PM -0800, Greg Ewing wrote:
Trent Nelson wrote:
So what about the "no processing that can be usefully done by a C level thread" bit? I'm trying to discern whether or not you're highlighting a fundamental flaw in the theory/idea ;-)
You seem to be assuming that the little bit of processing needed to get the data from kernel to user space is going to be significant compared to whatever the Python code is going to do with the data. That seems unlikely.
Great response. Again, highlights the need to have some standard way for benchmarking this sort of stuff. Would be a good use of Snakebite; everything is on a gigabit switch on the same subnet, multiple boxes could simulate client load against each server being benchmarked. (I don't think my idea would really start to show gains until you're dealing with tens of thousands of connections and 1Gb/10Gb Ethernet traffic, to be honest. So, as I mentioned in a previous e-mail, it depends on what our goals are for the next-gen AIO framework (performance at whatever the cost vs not).) Trent.
On 27/11/2012 8:19pm, Trent Nelson wrote:
Got it. So what about the "no processing that can be usefully done by a C level thread" bit? I'm trying to discern whether or not you're highlighting a fundamental flaw in the theory/idea;-)
(That it's going to be more optimal to have background threads service IO without the need to acquire the GIL, basically.)
I mean that I don't understand what sort of "servicing" you expect the background threads to do. If you just mean consuming packets from GetQueuedCompletionStatus() and pushing them on an interlocked stack then why bother? -- Richard
On Tue, Nov 27, 2012 at 01:42:33PM -0800, Richard Oudkerk wrote:
On 27/11/2012 8:19pm, Trent Nelson wrote:
Got it. So what about the "no processing that can be usefully done by a C level thread" bit? I'm trying to discern whether or not you're highlighting a fundamental flaw in the theory/idea;-)
(That it's going to be more optimal to have background threads service IO without the need to acquire the GIL, basically.)
I mean that I don't understand what sort of "servicing" you expect the background threads to do.
If you just mean consuming packets from GetQueuedCompletionStatus() and pushing them on an interlocked stack then why bother?
Theoretically: lower latency, higher throughput and better scalability (additional cores improves both) than alternate approaches when under load. Let's just say the goal of the new async IO framework is to be able to handle 65k simultaneous connections and/or saturate multiple 10Gb Ethernet links (or 16Gb FC, or 300Gb IB) on a system where a pure C/C++ solution using native libs (kqueue, epoll, IOCP, GCD etc) *could* do that. What async IO library of the future could come the closest? That's sort of the thought process I had, which lead to this idea. We should definitely have a systematic way of benchmarking this sort of stuff though, otherwise it's all conjecture. On that note, I came across a very interesting presentation a few weeks ago whilst doing research: http://www.mailinator.com/tymaPaulMultithreaded.pdf He makes some very interesting observations regarding contemporary performance of non-blocking versus thousands-of-blocking-threads. It highlights the importance of having a way to systematically test assumptions like "IOCP will handle load better than WSAPoll". Definitely worth the read. The TL;DR version is: - Thousands of threads doing blocking IO isn't as bad as everyone thinks. It used to suck, but these days, on multicore machines and contemporary kernels, it ain't so bad. - Throughput is much better using blocking IO than non. From Python's next-gen AIO perspective, I think it would be useful to define our goals. Is absolute balls-to-the-wall as-close-to-metal-as-possible performance (like 65k clients or 1GB/s saturation) the ultimate goal? If not, then what? Half that, but with scalability? Quarter of that, but with a beautifully elegant/simple API? Trent.
Den 27. nov. 2012 kl. 18:49 skrev Trent Nelson <trent@snakebite.org>:
Here's the "idea" I had, with zero working code to back it up: what if we had a bunch of threads in the background whose sole purpose it was to handle AIO? On Windows/AIX, they would poll GetQueuedCompletionStatus, on Solaris, get_event().
They're literally raw pthreads and have absolutely nothing to do with Python's threading.Thread() stuff. They exist solely in C and can't be interfaced to directly from Python code.
....which means they're free to run outside the GIL, and thus, multiple cores could be leveraged concurrently. (Only for processing completed I/O, but hey, it's better than nothing.)
And herein lies the misunderstanding. A Python thread can do the same processing of completed I/O before it reacquires the GIL – and thus Python can run on multiple cores concurrently. There is no difference between a pthread and a threading.Thread that has released the GIL. You don't need to spawn a pthread to process data independent if the GIL. You just need to process the data before the GIL is reacquired. In fact, I use Python threads for parallel computing all the time. They scale as well as OpenMP threads on multiple cores. Why? Because I have made sure the computational kernels (e.g. LAPACK functions) releases the GIL before they execute – and the GIL is not reacquired before they are done. As long as the threads are running in C or Fortran land they don't need the GIL. I don't need to spawn pthreads or use OpenMP pragmas to create threads that can run freely on all cores. Python threads (threading.Thread) can do that too. Sturla
On Tue, Nov 27, 2012 at 03:50:18PM -0800, Sturla Molden wrote:
Den 27. nov. 2012 kl. 18:49 skrev Trent Nelson <trent@snakebite.org>:
Here's the "idea" I had, with zero working code to back it up: what if we had a bunch of threads in the background whose sole purpose it was to handle AIO? On Windows/AIX, they would poll GetQueuedCompletionStatus, on Solaris, get_event().
They're literally raw pthreads and have absolutely nothing to do with Python's threading.Thread() stuff. They exist solely in C and can't be interfaced to directly from Python code.
....which means they're free to run outside the GIL, and thus, multiple cores could be leveraged concurrently. (Only for processing completed I/O, but hey, it's better than nothing.)
And herein lies the misunderstanding.
A Python thread can do the same processing of completed I/O before it reacquires the GIL – and thus Python can run on multiple cores concurrently. There is no difference between a pthread and a threading.Thread that has released the GIL. You don't need to spawn a pthread to process data independent if the GIL. You just need to process the data before the GIL is reacquired.
In fact, I use Python threads for parallel computing all the time. They scale as well as OpenMP threads on multiple cores. Why? Because I have made sure the computational kernels (e.g. LAPACK functions) releases the GIL before they execute – and the GIL is not reacquired before they are done. As long as the threads are running in C or Fortran land they don't need the GIL. I don't need to spawn pthreads or use OpenMP pragmas to create threads that can run freely on all cores. Python threads (threading.Thread) can do that too.
Perhaps I was a little too eager to highlight the ability for these background IO threads to run without needing to acquire the GIL. A Python thread could indeed do the same job, however, you still wouldn't interact with it from Python code as if it were a normal threading.Thread. Trent.
Den 28. nov. 2012 kl. 01:01 skrev Trent Nelson <trent@snakebite.org>:
Perhaps I was a little too eager to highlight the ability for these background IO threads to run without needing to acquire the GIL. A Python thread could indeed do the same job, however, you still wouldn't interact with it from Python code as if it were a normal threading.Thread.
threading.Thread allows us to spawn a thread in a platform-independent way. Here is a thread pool: pool = [Thread(target=foobar) for i in range(n)] These threads can release the GIL and call GetQueuedCompletionStatus. They can do any post-processing they want without the GIL. And they can return back to Python while holding the GIL. Using a pool of non-Python threads inbetween would also take some of the scalability of IOCPs away. The thread that was the last to run is the first to be woken up on IO completion. That way the kernel wakes up a thread that is likely still in cache. But if you prevent the IOCP from waking up the thread that will call back to Python, then this scalability trick is of no value. Sturla
On Tue, Nov 27, 2012 at 04:25:10PM -0800, Sturla Molden wrote:
Den 28. nov. 2012 kl. 01:01 skrev Trent Nelson <trent@snakebite.org>:
Perhaps I was a little too eager to highlight the ability for these background IO threads to run without needing to acquire the GIL. A Python thread could indeed do the same job, however, you still wouldn't interact with it from Python code as if it were a normal threading.Thread.
threading.Thread allows us to spawn a thread in a platform-independent way.
Here is a thread pool: pool = [Thread(target=foobar) for i in range(n)]
These threads can release the GIL and call GetQueuedCompletionStatus. They can do any post-processing they want without the GIL. And they can return back to Python while holding the GIL.
Using a pool of non-Python threads inbetween would also take some of the scalability of IOCPs away. The thread that was the last to run is the first to be woken up on IO completion. That way the kernel wakes up a thread that is likely still in cache. But if you prevent the IOCP from waking up the thread that will call back to Python, then this scalability trick is of no value.
I'm not sure where you're getting the idea that I'll be hampering the Windows IOCP optimizations that wake the last thread. Nothing I've described would have any impact on that. Trent.
participants (6)
-
Antoine Pitrou
-
Greg Ewing
-
Guido van Rossum
-
Richard Oudkerk
-
Sturla Molden
-
Trent Nelson