Re: [Python-ideas] An alternate approach to async IO

On 28.11.2012 16:49, Richard Oudkerk wrote:
You are assuming that GetQueuedCompletionStatus*() will never block because of lack of work.
GetQueuedCompletionStatusEx takes a time-out argument, it can be zero. Sturla

On 28/11/2012 6:59pm, Sturla Molden wrote:
According to your (or Trent's) idea the main thread busy waits until the interlocked list is non-empty. If there is no work to do then the interlocked list is empty and the main thread will busy wait till there is work to do, which might be for a long time. -- Richard

On 28.11.2012 20:11, Richard Oudkerk wrote:
That would not be an advantage. Surely it should time-out or at least stop busy-waiting at some point... But I am not sure if a list like Trent described is better than just calling GetQueuedCompletionStatusEx from the Python thread. One could busy-wait with 0 timeout for a while, and then at some point use a few ms timouts (1 or 2, or perhaps 10). IOCPs set up a task queue, so I am back to thinking that stacking two task queues after each other does not help very much--- Sturla

On Wed, Nov 28, 2012 at 11:11:42AM -0800, Richard Oudkerk wrote:
Oooer, that's definitely not what I had in mind. This is how I envisioned it working (think of events() as similar to poll()): with aio.events() as events: for event in events: # process event ... That aio.events() call would result in an InterlockedSListFlush, returning the entire list of available events. It then does the conversion into a CPython event type, bundles everything into a list, then returns. (In reality, there'd be a bit more glue to handle an empty list a bit more gracefully, and probably a timeout to aio.events(). Nothing should involve a spinlock though.) Trent.

On 28/11/2012 7:23pm, Trent Nelson wrote:
That api is fairly similar to what is in the proactor branch of tulip where you can write for event in proactor.poll(timeout): # process event But why use a use a thread pool just to take items from one thread safe (FIFO) queue and put them onto another thread safe (LIFO) queue? -- Richard

On Wed, Nov 28, 2012 at 11:57:29AM -0800, Richard Oudkerk wrote:
I'm not sure how "thread pool" got all the focus suddenly. That's just an implementation detail. The key thing I'm proposing is that we reduce the time involved in processing incoming IO requests. Let's ignore everything pre-Vista for the sake of example. From Vista onwards, we don't even need to call GetQueuedCompletionStatus, we simply tell the new thread pool APIs which C function to invoke upon an incoming event. This C function should do as little as possible, and should have a small a footprint as possible. So, no calling CPython, no GIL acquisition. It literally just processes completed events, copying data where necessary, then doing an interlocked list push of the results, then that's it, done. Now, on XP, AIX and Solaris, we'd manually have a little thread pool, and each thread would wait on GetQueuedCompletionStatus(Ex) or port_get(). That's really the only difference; the main method body would be identical to what Windows automatically invokes via the thread pool approach in Vista onwards. Trent.

Le Wed, 28 Nov 2012 15:18:19 -0500, Trent Nelson <trent@snakebite.org> a écrit :
At this point, I propose you start writing some code and come back with benchmark numbers, before claiming that your proposal improves performance at all. Further speculating about thread pools, async APIs and whatnot sounds completely useless to me. Regards Antoine.

On 29.11.2012 13:24, Trent Nelson wrote:
I'd also like to compare with a single-threaded design where the Python code calls GetQueuedCompletionStatusEx with a time-out. The idea here is an initial busy-wait with immediate time-out without releasing the GIL. Then after e.g. 2 ms we release the GIL and do a longer wait. That should also avoid excessive GIL shifting with "64k tasks". Personally I don't think a thread-pool will add to the scalability as long as the Python code just runs on a single core. Sturla

On 28/11/2012 6:59pm, Sturla Molden wrote:
According to your (or Trent's) idea the main thread busy waits until the interlocked list is non-empty. If there is no work to do then the interlocked list is empty and the main thread will busy wait till there is work to do, which might be for a long time. -- Richard

On 28.11.2012 20:11, Richard Oudkerk wrote:
That would not be an advantage. Surely it should time-out or at least stop busy-waiting at some point... But I am not sure if a list like Trent described is better than just calling GetQueuedCompletionStatusEx from the Python thread. One could busy-wait with 0 timeout for a while, and then at some point use a few ms timouts (1 or 2, or perhaps 10). IOCPs set up a task queue, so I am back to thinking that stacking two task queues after each other does not help very much--- Sturla

On Wed, Nov 28, 2012 at 11:11:42AM -0800, Richard Oudkerk wrote:
Oooer, that's definitely not what I had in mind. This is how I envisioned it working (think of events() as similar to poll()): with aio.events() as events: for event in events: # process event ... That aio.events() call would result in an InterlockedSListFlush, returning the entire list of available events. It then does the conversion into a CPython event type, bundles everything into a list, then returns. (In reality, there'd be a bit more glue to handle an empty list a bit more gracefully, and probably a timeout to aio.events(). Nothing should involve a spinlock though.) Trent.

On 28/11/2012 7:23pm, Trent Nelson wrote:
That api is fairly similar to what is in the proactor branch of tulip where you can write for event in proactor.poll(timeout): # process event But why use a use a thread pool just to take items from one thread safe (FIFO) queue and put them onto another thread safe (LIFO) queue? -- Richard

On Wed, Nov 28, 2012 at 11:57:29AM -0800, Richard Oudkerk wrote:
I'm not sure how "thread pool" got all the focus suddenly. That's just an implementation detail. The key thing I'm proposing is that we reduce the time involved in processing incoming IO requests. Let's ignore everything pre-Vista for the sake of example. From Vista onwards, we don't even need to call GetQueuedCompletionStatus, we simply tell the new thread pool APIs which C function to invoke upon an incoming event. This C function should do as little as possible, and should have a small a footprint as possible. So, no calling CPython, no GIL acquisition. It literally just processes completed events, copying data where necessary, then doing an interlocked list push of the results, then that's it, done. Now, on XP, AIX and Solaris, we'd manually have a little thread pool, and each thread would wait on GetQueuedCompletionStatus(Ex) or port_get(). That's really the only difference; the main method body would be identical to what Windows automatically invokes via the thread pool approach in Vista onwards. Trent.

Le Wed, 28 Nov 2012 15:18:19 -0500, Trent Nelson <trent@snakebite.org> a écrit :
At this point, I propose you start writing some code and come back with benchmark numbers, before claiming that your proposal improves performance at all. Further speculating about thread pools, async APIs and whatnot sounds completely useless to me. Regards Antoine.

On 29.11.2012 13:24, Trent Nelson wrote:
I'd also like to compare with a single-threaded design where the Python code calls GetQueuedCompletionStatusEx with a time-out. The idea here is an initial busy-wait with immediate time-out without releasing the GIL. Then after e.g. 2 ms we release the GIL and do a longer wait. That should also avoid excessive GIL shifting with "64k tasks". Personally I don't think a thread-pool will add to the scalability as long as the Python code just runs on a single core. Sturla
participants (4)
-
Antoine Pitrou
-
Richard Oudkerk
-
Sturla Molden
-
Trent Nelson