The async API of the future
Work priorities don't allow me to spend another day replying in detail to the various emails on this topic, but I am still keeping up reading! I have read Greg's response to my comparison between Future+yield-based coroutines and his yield-from-based, Future-free coroutines, and after having written a small prototype, I am now pretty much convinced that Greg's way is superior. This doesn't mean you can't use generators or yield-from for other purposes! It's just that *if* you are writing a coroutine for use with a certain schedule, you must use yield and yield-from in accordance to the scheduler's rules. However, code you call can still use yield and yield-from for iteration, and you can still use for-loops. In particular, if f is a coroutine, it can still write "for x in g(): ..." where g is a generator meant to be an iterator. However if g were instead a coroutine, f should call it using "yield from g()", and f and g should agree on the interface of their scheduler. As to other topics, my current feeling is that we should try to separately develop requirements and prototype implementations of the I/O loop of the future, and to figure the loosest possible coupling between that and a coroutine scheduler (or any other type of scheduler). In particular, I think the I/O loop should not assume the event handlers are implemented using coroutines -- but if someone wants to write an awesome coroutine scheduler, they should be able to delegate all their I/O waiting needs to the I/O loop with very little trouble. To me, this means that the I/O loop probably should use "plain" callback functions (i.e., not Futures, Deferreds or coroutines). We should also standardize the interface to the I/O loop so that 3rd parties can plug in their own I/O loop -- I don't see an end to the debate whether the best C library for event handling is libevent, libev or libuv. While the focus of the I/O loop should be on single-threaded event handling, some standard interface should exist so that you can run certain code in a separate thread and wait for its completion -- I've found this handy when calling socket.getaddrinfo(), which may block. (Apparently async DNS lookups are really hard -- I read some complaints about libevent's DNS lookups, and IIUC many Firefox lockups are due to this.) But there may be other uses for this too. An issue in the design of the I/O loop is the strain between a ready-based and completion-based design. The typical Unix design (whether based on select or any of the poll variants) is usually ready-based; but on Windows, the only way to get high performance is to base it on IOCP, which is completion-based (i.e. you start a specific async operation, like writing N bytes, and the I/O loop tells you when it is done). I would like people to be able to write fast event handling programs on Windows too, and ideally the only change would be the implementation of the I/O loop. But I don't know how tenable that is given the dramatically different style used by IOCP and the need to use native Windows API for all async I/O -- it sounds like we could only do this if the library providing the I/O loop implementation also wrapped all I/O operations, andthat may be a bit much. Finally, there should also be some minimal interface so that multiple I/O loops can interact -- at least in the case where one I/O loop belongs to a GUI library. It seems this is a solved problem (as well solved as you can hope for) to Twisted, so we should just adopt their approach. -- --Guido van Rossum (python.org/~guido)
Guido van Rossum wrote:
I would like people to be able to write fast event handling programs on Windows too, ... But I don't know how tenable that is given the dramatically different style used by IOCP and the need to use native Windows API for all async I/O -- it sounds like we could only do this if the library providing the I/O loop implementation also wrapped all I/O operations, and that may be a bit much.
That's been bothering me, too. It seems like an interface accommodating the completion-based style will have to be *extremely* fat. That's not just a burden for anyone implementing the interface, it's a problem for any library wanting to *wrap* it as well. For example, to maintain separation between the async layer and the generator layer, we will probably want to have an AsyncSocket object in the async layer, and a separate GeneratorSocket in the generator layer that wraps an AsyncSocket. If the AsyncSocket needs to provide methods for all the possible I/O operations that one might want to perform on a socket, then GeneratorSocket needs to provide its own versions of all those methods as well. Multiply that by the number of different kinds of I/O objects (files, sockets, message queues, etc. -- there seem to be quite a lot of them on Windows) and that's a *lot* of stuff to be wrapped.
Finally, there should also be some minimal interface so that multiple I/O loops can interact -- at least in the case where one I/O loop belongs to a GUI library.
That's another thing that worries me. With a ready-based event loop, this is fairly straightforward. If you can get hold of the file descriptor or handle that the GUI is ultimately reading its input from, all you need to do is add it as an event source to your main loop, and when it's ready, tell the GUI event loop to run itself once. But you can't do that with a completion-based main loop, because the actual reading of the input needs to be done in a different way, and that's usually buried somewhere deep in the GUI library where you can't easily change it.
It seems this is a solved problem (as well solved as you can hope for) to Twisted, so we should just adopt their approach.
Do they actually do it for an IOCP-based main loop on Windows? If so, I'd be interested to know how. -- Greg
On Fri, Oct 19, 2012 at 8:33 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote: ... snip ...
That's another thing that worries me. With a ready-based event loop, this is fairly straightforward. If you can get hold of the file descriptor or handle that the GUI is ultimately reading its input from, all you need to do is add it as an event source to your main loop, and when it's ready, tell the GUI event loop to run itself once.
For most windowing systems, this isn't true. You need to call some function to check if you have events pending. For X11, this is "XPending". For Win32, this is "GetQueueStatus". But overall, the thing is that most GUI libraries have their own event loops. In GTK+, this is done with a "GSource", which can have support for custom sources (which is how the calls to the above APIs are made). What Twisted does is this case is swap out their own select loop with another implementation built around GLib's GMainLoop, which uses whatever internally. I'd highly recommend taking Twisted's approach of having swappable event loops. The question then becomes how you swap out the main loop: Twisted does this with a global reactor which you "install", which the community has found rather ugly, but there isn't really a better solution they've come up with. They've had a few proposals over the years to add better functionality, so I'd like to hear their experience on this.
-- Greg
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- Jasper
Jasper St. Pierre wrote:
For most windowing systems, this isn't true. You need to call some function to check if you have events pending. For X11, this is "XPending". For Win32, this is "GetQueueStatus".
X11 is ultimately reading its events from the socket to the display server. If you select() that socket, it will tell you whenever the X11 event loop could possibly have something to do. On Windows, I imagine the equivalent would be to pass your message queue handle to a WaitForMultipleObjects call. I've never tried to do anything like that, though, so I don't know if it would really work.
What Twisted does is this case is swap out their own select loop with another implementation built around GLib's GMainLoop,
If it's truly impossible to incorporate GMainLoop as a sub-loop of something else, then this is a bad situation. What happens if you also want to use some other library that insists on *its* main loop being in charge? This cannot be a general solution. -- Greg
On Fri, Oct 19, 2012 at 11:11 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Jasper St. Pierre wrote:
For most windowing systems, this isn't true. You need to call some function to check if you have events pending. For X11, this is "XPending". For Win32, this is "GetQueueStatus".
X11 is ultimately reading its events from the socket to the display server. If you select() that socket, it will tell you whenever the X11 event loop could possibly have something to do.
Nope. libX11/XCB keep their own queue of events and do their own socket management, so it's not just "poll on this FD, thanks" http://cgit.freedesktop.org/xorg/lib/libX11/tree/src/Pending.c http://cgit.freedesktop.org/xorg/lib/libX11/tree/src/xcb_io.c#n344
On Windows, I imagine the equivalent would be to pass your message queue handle to a WaitForMultipleObjects call. I've never tried to do anything like that, though, so I don't know if it would really work.
What Twisted does is this case is swap out their own select loop with another implementation built around GLib's GMainLoop,
If it's truly impossible to incorporate GMainLoop as a sub-loop of something else, then this is a bad situation. What happens if you also want to use some other library that insists on *its* main loop being in charge? This cannot be a general solution.
GLib has a way of embedding its main loop in another, but it's not easy or viable to use in a situation like this. It basically splits up its event loop into multiple pieces (prepare, check, dispatch), which you call at various times. Qt uses this for their GLib mainloop integration. It's clear there's never going to be one event loop solution (as Guido already mentioned, there's wars about libuv/libevent/libev that we can't possibly resolve), so why pretend like there is?
-- Greg _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- Jasper
Jasper St. Pierre wrote:
Nope. libX11/XCB keep their own queue of events and do their own socket management, so it's not just "poll on this FD, thanks"
So you keep going until the internal buffer is empty. "Run once" is probably a bit inaccurate; it's really more like "run until you don't think there's anything more to do".
It's clear there's never going to be one event loop solution (as Guido already mentioned, there's wars about libuv/libevent/libev that we can't possibly resolve), so why pretend like there is?
This discussion seems to have got off track. I'm not opposed to being able to choose whichever top-level event loop works the best for your application. All I set out to say is that a wait-for-ready style event loop seems more amenable to having other event loops plugged into it than a wait-for-completion one. But maybe that's not a problem if we provide an IOCP-based event loop that can be plugged into the wait-for-ready loop of your choice. Is that likely to be feasible? -- Greg
On 20/10/2012 1:33am, Greg Ewing wrote:
That's been bothering me, too. It seems like an interface accommodating the completion-based style will have to be *extremely* fat.
That's not just a burden for anyone implementing the interface, it's a problem for any library wanting to *wrap* it as well.
For example, to maintain separation between the async layer and the generator layer, we will probably want to have an AsyncSocket object in the async layer, and a separate GeneratorSocket in the generator layer that wraps an AsyncSocket.
If the AsyncSocket needs to provide methods for all the possible I/O operations that one might want to perform on a socket, then GeneratorSocket needs to provide its own versions of all those methods as well.
Multiply that by the number of different kinds of I/O objects (files, sockets, message queues, etc. -- there seem to be quite a lot of them on Windows) and that's a *lot* of stuff to be wrapped.
I don't see why a completion api needs to create wrappers for sockets. See http://pastebin.com/7tDmeYXz for an implementation of a completion api implemented for Unix (plus a stupid reactor class and some example server/client code). The AsyncIO class is independent of reactors, futures etc. The methods for starting an operation are recv(key, sock, nbytes, flags=0) send(key, sock, buf, flags=0) accept(key, sock) connect(key, sock, address) The "key" argument is used as an identifier for the operation. You wait for something to complete using wait(timeout=None) which returns a list of tuples of the form "(key, success, value)" representing completed operations. "key" is the identifier used when starting the operation, "success" is a boolean indicating whether an error occurred, and "value" is the return/exception value. To check whether there are any outstanding operations, use empty() (To make the AsyncIO class usable without a reactor one should probably implement a "filtered" wait so that you can restrict the keys you want to wait for.) -- Richard
Richard Oudkerk wrote:
I don't see why a completion api needs to create wrappers for sockets. See
...
The AsyncIO class is independent of reactors, futures etc. The methods for starting an operation are
recv(key, sock, nbytes, flags=0) send(key, sock, buf, flags=0) accept(key, sock) connect(key, sock, address)
That looks awfully like a wrapper for a socket to me. All of those system calls are peculiar to sockets. There doesn't necessarily have to be a wrapper class for each kind of file descriptor. There could be one I/O class that handles everything, or there could just be a collection of functions. The point is that, with a completion-based model, you need a function or method for every possible system call that you might want to perform asynchronously. -- Greg
On Sat, Oct 20, 2012 at 4:41 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
The point is that, with a completion-based model, you need a function or method for every possible system call that you might want to perform asynchronously.
TBH, I like APIs that wrap all system calls. System calls have too many low-level details that you have to be aware of, and they too often vary per platform. (I just wrote a simple event loop plus scheduler along the lines of your essay, extending it to the point where I could do basic, fully-async, HTTP exchanges. The number of details I had to take care of was excruciating; and then there were the subtle differences between OSX and Ubuntu.) -- --Guido van Rossum (python.org/~guido)
On Oct 20, 2012, at 4:53 PM, Guido van Rossum <guido at python.org> wrote:
On Sat, Oct 20, 2012 at 4:41 PM, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
The point is that, with a completion-based model, you need a function or method for every possible system call that you might want to perform asynchronously.
TBH, I like APIs that wrap all system calls. System calls have too many low-level details that you have to be aware of, and they too often vary per platform. (I just wrote a simple event loop plus scheduler along the lines of your essay, extending it to the point where I could do basic, fully-async, HTTP exchanges. The number of details I had to take care of was excruciating; and then there were the subtle differences between OSX and Ubuntu.)
The layer that wraps the system calls does not necessarily be visible to applications. You absolutely need the syscalls to be exposed directly at some lower, non-standardized level, because it takes on average 15 years to shake out all the differences between platform behavior that you observed here :-). If applications try to do this, they will always get it wrong, and besides, they want to be making different syscalls for different transports. Much of Twisted's development has been about discovering exciting new behaviors on new platforms or new versions of supported platforms in the face of new levels of load, concurrency, or some other attribute. A minor nitpick: system calls aren't usually be performed asynchronously; you execute the syscall non-blockingly, and then you complete the action asynchronously. The whole idea of asynchronous I/O via non-blocking APIs implies some level of syscall wrapping.) -glyph
On Sun, 21 Oct 2012 12:41:41 +1300 Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Richard Oudkerk wrote:
I don't see why a completion api needs to create wrappers for sockets. See
...
The AsyncIO class is independent of reactors, futures etc. The methods for starting an operation are
recv(key, sock, nbytes, flags=0) send(key, sock, buf, flags=0) accept(key, sock) connect(key, sock, address)
That looks awfully like a wrapper for a socket to me. All of those system calls are peculiar to sockets.
There doesn't necessarily have to be a wrapper class for each kind of file descriptor. There could be one I/O class that handles everything, or there could just be a collection of functions.
The point is that, with a completion-based model, you need a function or method for every possible system call that you might want to perform asynchronously.
There aren't that many of them, though: the four Richard listed should already be enough for most network applications, AFAIK. I really think Richard's proposal is a sane building block. Regards Antoine.
Den 19. okt. 2012 kl. 18:05 skrev Guido van Rossum <guido@python.org>:
An issue in the design of the I/O loop is the strain between a ready-based and completion-based design. The typical Unix design (whether based on select or any of the poll variants) is usually ready-based; but on Windows, the only way to get high performance is to base it on IOCP, which is completion-based (i.e. you start a specific async operation, like writing N bytes, and the I/O loop tells you when it is done). I would like people to be able to write fast event handling programs on Windows too, and ideally the only change would be the implementation of the I/O loop. But I don't know how tenable that is given the dramatically different style used by IOCP and the need to use native Windows API for all async I/O -- it sounds like we could only do this if the library providing the I/O loop implementation also wrapped all I/O operations, andthat may be a bit much.
Not really, no. IOCP might be the easiest way to get high performance on Windows, but certainly not the only. IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations. Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function. Then the problem is polling for "ready-to-read" and "ready-to-write". The annoying part is that different types of files (disk files, sockets, pipes, named pipes, hardware devices) must be polled with different Windows API calls – but there are non-blocking calls to poll them all. For this reason, Cygwin's select function spawn one thread to poll each type of file. Threads are very cheap on Windows, and polling loops can use Sleep(0) to relese the remainder of their time-slice, so this kind of polling is not very expensive. However, if we use a thread-pool for the polling, instead of spawing new threads on each call to select, we would be doing more or less the same as Windows built-in IOCPs, except we are signalling "ready" instead of "finished". Thus, I think it is possible to get high performance without IOCP. But Microsoft has only implemented a select call for sockets. My suggestion would be to forget about IOCP and implement select for more than just sockets on Windows. The reason for this is that select and IOCP are signalling on different side of the I/O operation (ready vs. completed). So programs based on select ans IOCP tend to have opposite logics with respect to scheduling I/O. And as the general trend today is to develop for Unix and then port to Windows (as most programmers find the Windows API annoying), I think it would be better to port select (and perhaps poll and epoll) to Windows than provide IOCP to Python. Sturla
On Fri, 2 Nov 2012 22:29:09 +0100 Sturla Molden <sturla@molden.no> wrote:
IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
Hmm, but the basic problem with WaitForMultipleObjects is that it has a hard limit of 64 objects you can wait on. Regards Antoine.
Den 2. nov. 2012 kl. 23:14 skrev Antoine Pitrou <solipsis@pitrou.net>:
On Fri, 2 Nov 2012 22:29:09 +0100 Sturla Molden <sturla@molden.no> wrote:
IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
Hmm, but the basic problem with WaitForMultipleObjects is that it has a hard limit of 64 objects you can wait on.
So you nest them in a tree, each node having up to 64 children... The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144... For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up. Sturla
On 02/11/2012 11:10pm, Sturla Molden wrote:
So you nest them in a tree, each node having up to 64 children...
The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...
For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.
Windows already has RegisterWaitForSingleObject() which basically does what you describe: http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685061%28v=vs.85%2... -- Richard.
Den 3. nov. 2012 kl. 01:32 skrev Richard Oudkerk <shibturn@gmail.com>:
On 02/11/2012 11:10pm, Sturla Molden wrote:
So you nest them in a tree, each node having up to 64 children...
The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...
For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.
Windows already has RegisterWaitForSingleObject() which basically does what you describe:
http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685061%28v=vs.85%2...
No, it does something completely different. It registers a callback function for a single event object and waits. We were talking about multiplexing a wait for more than 64 objects. Sturla
On 03/11/2012 9:22am, Sturla Molden wrote:
No, it does something completely different. It registers a callback function for a single event object and waits. We were talking about multiplexing a wait for more than 64 objects.
By using an appropriate callback you easily implement something like WaitForMultipleObjects() which does not have the 64 handle limit (without having to explicitly start any threads). More usefully if the callback posts a message to an IOCP then it lets you use the IOCP to wait on non-IO things. -- Richard
Den 3. nov. 2012 kl. 12:20 skrev Richard Oudkerk <shibturn@gmail.com>:
On 03/11/2012 9:22am, Sturla Molden wrote:
No, it does something completely different. It registers a callback function for a single event object and waits. We were talking about multiplexing a wait for more than 64 objects.
By using an appropriate callback you easily implement something like WaitForMultipleObjects() which does not have the 64 handle limit (without having to explicitly start any threads).
But it uses a thread-pool that polls the registered wait objects, so the overhead (with respect to latency) will still be O(n). It does not matter if you ask Windows to allocate a thread-pool for the polling or if you do the polling yourself. It is still user-space threads that polls N objects with O(n) complexity. But if you nest WaitForMultipleObjects, you can get the latency down to O(log n). IOCP is just an abstraction for a thread-pool and a FIFO. If you want to use a thread-pool and a FIFO to wait for something else than I/O there are easier ways. For example, you can use the queue functions in NT6 and enqueue whatever APC you want – or just use a list of threads and a queue in Python. Sturla
Sturla Molden wrote:
But it uses a thread-pool that polls the registered wait objects, so the overhead (with respect to latency) will still be O(n).
I'm not sure exactly what you mean by "polling" here. I'm pretty sure that *none* of the mechanisms we're talking about here (select, poll, kqueue, IOCP, WaitForMultipleWhatever, etc) indulge in busy-waiting while looping over the relevant handles. They all ultimately make use of hardware interrupts to wake up a thread when something interesting happens. The scaling issue, as I understand it, is that select() and WaitForMultipleObjects() require you to pass in the entire list of fds or handles on every call, so that there is an O(n) setup cost every time you wait. A more scaling-friendly API would let you pre-register the set of interesting objects, so that the actual waiting call is O(1). I believe this is the reason things like epoll, kqueue and IOCP are considered more scalable. -- Greg
On Sat, Nov 3, 2012 at 4:49 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Sturla Molden wrote:
But it uses a thread-pool that polls the registered wait objects, so the overhead (with respect to latency) will still be O(n).
I'm not sure exactly what you mean by "polling" here. I'm pretty sure that *none* of the mechanisms we're talking about here (select, poll, kqueue, IOCP, WaitForMultipleWhatever, etc) indulge in busy-waiting while looping over the relevant handles. They all ultimately make use of hardware interrupts to wake up a thread when something interesting happens.
The scaling issue, as I understand it, is that select() and WaitForMultipleObjects() require you to pass in the entire list of fds or handles on every call, so that there is an O(n) setup cost every time you wait.
A more scaling-friendly API would let you pre-register the set of interesting objects, so that the actual waiting call is O(1). I believe this is the reason things like epoll, kqueue and IOCP are considered more scalable.
I've been thinking about this too. I can see the scalability issues with select(), but frankly, poll(), epoll(), and even kqueue() all look similar in O() behavior to me from an API perspective. I guess the differences are in the kernel -- but is it a constant factor or an unfortunate O(N) or worse? To what extent would this be overwhelmed by overhead in the Python code we're writing around it? How bad is it to add extra register()/unregister() (or (modify()) calls per read operation? -- --Guido van Rossum (python.org/~guido) -- --Guido van Rossum (python.org/~guido)
On Sun, Nov 4, 2012 at 7:26 AM, Guido van Rossum <guido@python.org> wrote:
I've been thinking about this too. I can see the scalability issues with select(), but frankly, poll(), epoll(), and even kqueue() all look similar in O() behavior to me from an API perspective. I guess the differences are in the kernel -- but is it a constant factor or an unfortunate O(N) or worse? To what extent would this be overwhelmed by overhead in the Python code we're writing around it? How bad is it to add extra register()/unregister() (or (modify()) calls per read operation?
The extra system calls add up. The interface of Tornado's IOLoop was based on epoll (where the internal state is roughly a mapping {fd: event_set}), so it requires more register/unregister operations when running on kqueue (where the internal state is roughly a set of (fd, event) pairs). This shows up in benchmarks of the HTTPServer; it's faster on platforms with epoll than platforms with kqueue. In low-concurrency scenarios it's actually faster to use select() even when kqueue is available (or maybe that's a mac-specific quirk). -Ben
On Sun, Nov 4, 2012 at 8:11 AM, Ben Darnell <ben@bendarnell.com> wrote:
The extra system calls add up. The interface of Tornado's IOLoop was based on epoll (where the internal state is roughly a mapping {fd: event_set}), so it requires more register/unregister operations when running on kqueue (where the internal state is roughly a set of (fd, event) pairs). This shows up in benchmarks of the HTTPServer; it's faster on platforms with epoll than platforms with kqueue. In low-concurrency scenarios it's actually faster to use select() even when kqueue is available (or maybe that's a mac-specific quirk).
Awesome info! -- --Guido van Rossum (python.org/~guido)
On 11/4/12 8:11 AM, Ben Darnell wrote:
The extra system calls add up. The interface of Tornado's IOLoop was based on epoll (where the internal state is roughly a mapping {fd: event_set}), so it requires more register/unregister operations when running on kqueue (where the internal state is roughly a set of (fd, event) pairs). This shows up in benchmarks of the HTTPServer; it's faster on platforms with epoll than platforms with kqueue. In low-concurrency scenarios it's actually faster to use select() even when kqueue is available (or maybe that's a mac-specific quirk).
Just so I have this right, you're saying that HTTPServer is slower on kqueue because of the IOLoop design, yes? I've just looked over the epoll interface and I see at least one huge difference compared to kqueue: it requires a system call for each fd registration event. With kevent() you can accumulate thousands of registrations, shove them into a single kevent() call and get thousands of events out. It's a little all-singing-all-dancing, but it's hard to imagine a way to do it using fewer system calls. 8^) -Sam
On Mon, Nov 5, 2012 at 11:30 AM, Sam Rushing < sam-pydeas@rushing.nightmare.com> wrote:
On 11/4/12 8:11 AM, Ben Darnell wrote:
The extra system calls add up. The interface of Tornado's IOLoop was based on epoll (where the internal state is roughly a mapping {fd: event_set}), so it requires more register/unregister operations when running on kqueue (where the internal state is roughly a set of (fd, event) pairs). This shows up in benchmarks of the HTTPServer; it's faster on platforms with epoll than platforms with kqueue. In low-concurrency scenarios it's actually faster to use select() even when kqueue is available (or maybe that's a mac-specific quirk).
Just so I have this right, you're saying that HTTPServer is slower on kqueue because of the IOLoop design, yes?
Yes. When the server processes a request and switches from listening for readability to listening for writability, with epoll it's one call directly into the C module to set the event mask for the socket. With kqueue something in the IOLoop must store the previous state and generate the two separate actions to remove the read listener and add a write listener. I misspoke when I mentioned system call; the difference is actually the amount of python code that must be run to call the right C functions. This would get a lot better if more of the IOLoop were written in C.
I've just looked over the epoll interface and I see at least one huge difference compared to kqueue: it requires a system call for each fd registration event. With kevent() you can accumulate thousands of registrations, shove them into a single kevent() call and get thousands of events out. It's a little all-singing-all-dancing, but it's hard to imagine a way to do it using fewer system calls. 8^)
True, although whenever I've tried to be clever and batch up kevent calls I haven't gotten the performance I'd hoped for because system calls aren't actually that expensive in comparison to python opcodes. Also at least some versions of Mac OS have a bug where you can only pass one event at a time. -Ben
-Sam
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On 11/6/12 7:41 AM, Ben Darnell wrote:
Yes. When the server processes a request and switches from listening for readability to listening for writability, with epoll it's one call directly into the C module to set the event mask for the socket. With kqueue something in the IOLoop must store the previous state and generate the two separate actions to remove the read listener and add a write listener.
Does that mean you're not using EV_ONESHOT?
I misspoke when I mentioned system call; the difference is actually the amount of python code that must be run to call the right C functions. This would get a lot better if more of the IOLoop were written in C.
That's what we did with shrapnel, though we split the difference and wrote everything in Pyrex.
True, although whenever I've tried to be clever and batch up kevent calls I haven't gotten the performance I'd hoped for because system calls aren't actually that expensive in comparison to python opcodes.
And yeah, of course all this is dominated by time in the python VM... Also, you still have to execute all the read/write system calls, so it only cuts it in half. -Sam
On 04/11/12 15:26, Guido van Rossum wrote:
I've been thinking about this too. I can see the scalability issues with select(), but frankly, poll(), epoll(), and even kqueue() all look similar in O() behavior to me from an API perspective. I guess the differences are in the kernel -- but is it a constant factor or an unfortunate O(N) or worse? To what extent would this be overwhelmed by overhead in the Python code we're writing around it? How bad is it to add extra register()/unregister() (or (modify()) calls per read operation?
At the C level poll() and epoll() have quite different APIs. Each time you use poll() you have to pass an array which describes the events you are interested in. That is not necessary with epoll(). The python API hides the difference. -- Richard
I came upon a set of blog posts today that some folks who are tracking this async discussion might find interesting for further research and ideas. http://blog.incubaid.com/2012/04/02/tracking-asynchronous-io-using-type-syst... -Kevin
On Sat, Nov 3, 2012 at 9:10 AM, Sturla Molden <sturla@molden.no> wrote:
The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...
For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.
Given that the purposes of using async IO is to improve scalability on a single machine by a factor of 100 or more beyond what is typically possible with threads or processes, hard capping the scaling improvement on Windows at 64x the thread limit by relying on WaitForMultipleObjects seems to be rather missing the point. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Den 3. nov. 2012 kl. 10:35 skrev Nick Coghlan <ncoghlan@gmail.com>:
On Sat, Nov 3, 2012 at 9:10 AM, Sturla Molden <sturla@molden.no> wrote:
The root allows us to wait for 64 objects, the first branch allows us to wait for 4096, and the second 262144...
For example, if 4096 wait objects are enough, we can use a pool of 64 threads. Each thread calls WaitForMultipleObjects on up to 64 wait objects, and signals to the master when it wakes up.
Given that the purposes of using async IO is to improve scalability on a single machine by a factor of 100 or more beyond what is typically possible with threads or processes, hard capping the scaling improvement on Windows at 64x the thread limit by relying on WaitForMultipleObjects seems to be rather missing the point.
The only thread limitation on Windows 64 is the amount of RAM. IOCPs are also thread-based (they are actually user-space thread pools). Sturla
Den 2. nov. 2012 kl. 23:14 skrev Antoine Pitrou <solipsis@pitrou.net>:
On Fri, 2 Nov 2012 22:29:09 +0100 Sturla Molden <sturla@molden.no> wrote:
IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
Hmm, but the basic problem with WaitForMultipleObjects is that it has a hard limit of 64 objects you can wait on.
Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously. At the end of the loop, call Sleep(0) to avoid burning the CPU. A small number of threads could also be used to run this loop in parallel. Sturla
Regards
Antoine.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Sat, 3 Nov 2012 00:21:43 +0100 Sturla Molden <sturla@molden.no> wrote:
Den 2. nov. 2012 kl. 23:14 skrev Antoine Pitrou <solipsis@pitrou.net>:
On Fri, 2 Nov 2012 22:29:09 +0100 Sturla Molden <sturla@molden.no> wrote:
IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
Hmm, but the basic problem with WaitForMultipleObjects is that it has a hard limit of 64 objects you can wait on.
Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
Well, that's basically O(number of objects), isn't it? Regards Antoine.
Sendt fra min iPad Den 3. nov. 2012 kl. 00:30 skrev Antoine Pitrou <solipsis@pitrou.net>:
On Sat, 3 Nov 2012 00:21:43 +0100 Sturla Molden <sturla@molden.no> wrote:
Den 2. nov. 2012 kl. 23:14 skrev Antoine Pitrou <solipsis@pitrou.net>:
On Fri, 2 Nov 2012 22:29:09 +0100 Sturla Molden <sturla@molden.no> wrote:
IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
Hmm, but the basic problem with WaitForMultipleObjects is that it has a hard limit of 64 objects you can wait on.
Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
Well, that's basically O(number of objects), isn't it?
Yes, but nesting would be O(log64 n).
On Sat, 3 Nov 2012 00:50:15 +0100 Sturla Molden <sturla@molden.no> wrote:
Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
Well, that's basically O(number of objects), isn't it?
Yes, but nesting would be O(log64 n).
No, you still have O(n) calls to WaitForMultipleObjects, just arranged differently. (in other words, the depth of your tree is O(log n), but its number of nodes is O(n)) Regards Antoine.
Den 3. nov. 2012 kl. 00:54 skrev Antoine Pitrou <solipsis@pitrou.net>:
Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
Well, that's basically O(number of objects), isn't it?
Yes, but nesting would be O(log64 n).
No, you still have O(n) calls to WaitForMultipleObjects, just arranged differently. (in other words, the depth of your tree is O(log n), but its number of nodes is O(n))
True, but is the time latency O(n) or O(log n)? Also, from what I read, the complexity of select.poll is O(n) with respect to file handles, so this should not be any worse (O(log n) katency wait, O(n) polling) I think. Another interesting strategy for high-performance on Windows 64: Just use blocking i/o and one thread per client. The stack-space limitation is a 32-bit problem, and Windows 64 has no problem scheduling an insane number of threads. Even desktop computers today can have 16 GB of RAM, so there is virtually no limitation on the number of i/o threads Windows 64 can multiplex. But would it scale with Python threads and the GIL as well? You would be better to answer that. Sturla
On Sat, 3 Nov 2012 10:37:41 +0100 Sturla Molden <sturla@molden.no> wrote:
Den 3. nov. 2012 kl. 00:54 skrev Antoine Pitrou <solipsis@pitrou.net>:
Or a simpler solution than nesting them into a tree: Let the calls to WaitForMultipleObjects time out at once, and loop over as many events as you need, polling 64 event objects simultaneously.
Well, that's basically O(number of objects), isn't it?
Yes, but nesting would be O(log64 n).
No, you still have O(n) calls to WaitForMultipleObjects, just arranged differently. (in other words, the depth of your tree is O(log n), but its number of nodes is O(n))
True, but is the time latency O(n) or O(log n)?
Right, that's the difference. However, I think here we are concerned about CPU load on the server, not individual latency (as long as it is acceptable, e.g. lower than 5 ms).
Also, from what I read, the complexity of select.poll is O(n) with respect to file handles, so this should not be any worse (O(log n) katency wait, O(n) polling) I think.
epoll and kqueue are better than O(number of objects) though.
Another interesting strategy for high-performance on Windows 64: Just use blocking i/o and one thread per client. The stack-space limitation is a 32-bit problem, and Windows 64 has no problem scheduling an insane number of threads. Even desktop computers today can have 16 GB of RAM, so there is virtually no limitation on the number of i/o threads Windows 64 can multiplex.
That's still a huge waste of RAM, isn't it? Also, by relying on preemptive threading you have to use Python's synchronization primitives (locks, etc.), and I don't know how these would scale.
But would it scale with Python threads and the GIL as well? You would be better to answer that.
I haven't done any tests with a large number of threads, but the GIL certainly has a (per-thread as well as per-context switch) overhead. Regards Antoine.
Den 3. nov. 2012 kl. 11:14 skrev Antoine Pitrou <solipsis@pitrou.net>:
True, but is the time latency O(n) or O(log n)?
Right, that's the difference. However, I think here we are concerned about CPU load on the server, not individual latency (as long as it is acceptable, e.g. lower than 5 ms).
Ok, I can do som tests on Windows :-)
Also, from what I read, the complexity of select.poll is O(n) with respect to file handles, so this should not be any worse (O(log n) katency wait, O(n) polling) I think.
epoll and kqueue are better than O(number of objects) though.
I know, they claim the wait to be about O(1). I guess that magic happens in the kernel. With IOCP on Windows there is a thread-pool that continuously polls the i/o tasks for completion. So I think IOCPs might approach O(n) at some point. I assume as long as we are staying in user-space, there will always be a O(n) overhead somewhere. To avoid it one would need the kernel to trigger callbacks from hardware interrupts, which presumably is what epoll and kqueue do. But at least on Windows, anything except "one-thread-per-client" involves O(n) polling by user-space threads (even IOCPs and RegisterWaitForSingleObject do that). The kernel only schedules threads, it does not trigger i/o callbacks from hardware. But who in their right mind use Windows for these kind of servers anyway?
Another interesting strategy for high-performance on Windows 64: Just use blocking i/o and one thread per client. The stack-space limitation is a 32-bit problem, and Windows 64 has no problem scheduling an insane number of threads. Even desktop computers today can have 16 GB of RAM, so there is virtually no limitation on the number of i/o threads Windows 64 can multiplex.
That's still a huge waste of RAM, isn't it?
That is depending on perspective :-)If threads are a simpler design pattern than IOCPs, the latter is a huge waste of work hours. Which is cheaper today? Think to some extent IOCPs solves a problem related to 32-bit address spaces or limited RAM. But if RAM is cheaper than programming effort, just go ahead and waste as much as you need :-) Also, those who need these kind of servers can certainly afford to buy enough RAM.
Also, by relying on preemptive threading you have to use Python's synchronization primitives (locks, etc.), and I don't know how these would scale.
But would it scale with Python threads and the GIL as well? You would be better to answer that.
I haven't done any tests with a large number of threads, but the GIL certainly has a (per-thread as well as per-context switch) overhead.
That is the thing, plain Windows threads and Python threads in huge numbers might not behave similarly. It would be interesting to test. Sturla
On Sat, 3 Nov 2012 12:47:53 +0100 Sturla Molden <sturla@molden.no> wrote:
Also, from what I read, the complexity of select.poll is O(n) with respect to file handles, so this should not be any worse (O(log n) katency wait, O(n) polling) I think.
epoll and kqueue are better than O(number of objects) though.
I know, they claim the wait to be about O(1). I guess that magic happens in the kernel.
They are not O(1), they are O(number of ready objects).
With IOCP on Windows there is a thread-pool that continuously polls the i/o tasks for completion. So I think IOCPs might approach O(n) at some point.
Well, I don't know about the IOCP implementation, but "continuously polling the I/O tasks" sounds like a costly way to do it (what system call would that use?). If the kernel cooperates, no continuous polling should be required.
That is depending on perspective :-)If threads are a simpler design pattern than IOCPs, the latter is a huge waste of work hours.
Er, the whole point of this discussion is to design a library so that the developer does *not* have to deal with IOCPs. As for "simpler design pattern", I think it's mostly a matter of habit. Writing a network daemon with Twisted is not difficult. And making multi-threaded code scale properly might not be trivial, depending on the problem. Regards Antoine.
With IOCP on Windows there is a thread-pool that continuously polls
On 03.11.2012 18:22, Antoine Pitrou wrote: the i/o tasks
for completion. So I think IOCPs might approach O(n) at some point.
Well, I don't know about the IOCP implementation, but "continuously polling the I/O tasks" sounds like a costly way to do it (what system call would that use?).
The polling uses the system call GetOverlappedResult, and if the task is unfinished, call Sleep(0) to release the time-slice and poll again. Specifically, if the last argument to GetOverlappedResult is FALSE, and the return value is FALSE, we must call GetLastError to retrieve an error code. If GetLastError returns ERROR_IO_INCOMPLETE, we know that the task was not finished. A bit more sophisticated: Put all these asynchronous i/o tasks in a fifo queue, and set up a thread-pool that pops tasks off the queue and polls with GetOverlappedResult and GetLastError. A task that is unfinished goes back into the queue. If a task is complete, the thread that popped it off the queue executes a callback. A thread-pool that operates like this will reduce/prevent the excessive number of context shifts in the kernel as multiple threads hammering on Sleep(0) would incur. Then invent a fancy name for this scheme, e.g. call it "I/O Completion Ports". Then you notice that due to the queue, the latency is proportional to O(n) with n the number of pending i/o tasks in the "I/O Completion Port". To avoid this affecting the latency, you patch your program by setting up multiple "I/O Completion Ports", and reinvent the load balancer to distribute i/o tasks to multiple "ports". With a bit of work, the server will remain responsive and "rather scalable" as long as the server is still i/o bound. At the moment the number of i/o tasks makes the server go CPU bound, which will happen rather soon because of they way IOCPs operate, the computer overheats and goes up in smoke. And that is when the MBA manager starts to curse Windows as well, and finally agrees to use Linux or *BSD/Apple instead ;-)
If the kernel cooperates, no continuous polling should be required.
Indeed. However: My main problem with IOCP is that they provide the "wrong" signal. They tell us when I/O is completed. But then the work is already done, and how did we know when to start? The asynch i/o in select, poll, epoll, kqueue, /dev/poll, etc. do the opposite. They inform us when to start an i/o task, which makes more sense to me at least. Typically, programs that use IOCP must invent their own means of signalling "i/o ready to start", which might kill any advantage of using IOCPs over simpler means (e.g. blocking i/o). This by the way makes me wonder what Windows SUA does? It is OpenBSD based. Does it have kqueue or /dev/poll? If so, there must be support for it in ntdll.dll, and we might use those functions instead of pesky IOCPs. Sturla
On Mon, Nov 5, 2012 at 6:19 AM, Sturla Molden <sturla@molden.no> wrote:
My main problem with IOCP is that they provide the "wrong" signal. They tell us when I/O is completed. But then the work is already done, and how did we know when to start?
The asynch i/o in select, poll, epoll, kqueue, /dev/poll, etc. do the opposite. They inform us when to start an i/o task, which makes more sense to me at least.
Typically, programs that use IOCP must invent their own means of signalling "i/o ready to start", which might kill any advantage of using IOCPs over simpler means (e.g. blocking i/o).
From that perspective, the Windows model is actually easier to grasp
This sounds like you are thoroughly used to the UNIX way and don't appreciate how odd that feels to someone first learning about it (after having used blocking I/O, perhaps in threads for years). than the UNIX model, because it is more similar to the synchronous model: in the synchronous model, you say e.g. "fetch the next 32 bytes"; in the async model you say, "start fetching the next 32 bytes and tell me when you've got them". Whereas in the select()-based model, you have to change your code to say "tell me when I can fetch some more bytes without blocking" and when you are told you have to fetch *some* bytes" but you may not get all 32 bytes, and it is even possible that the signal was an outright lie, so you have to build a loop around this until you actually have gotten 32 bytes. Same if instead of 32 bytes you want the next line -- select() and friends don't tell you whether you can read a whole line, just when at least one more byte is ready. So it's all a matter of perspective, and there is nothing "wrong" with IOCP. Note, I don't think there is anything wrong with the select() model either -- they're just different but equally valid models of the world, that cause you to structure your code vastly different. Like wave vs. particle, almost. -- --Guido van Rossum (python.org/~guido)
Guido van Rossum wrote:
when you are told you have to fetch *some* bytes" but you may not get all 32 bytes ... so you have to build a loop around this until you actually have gotten 32 bytes. Same if instead of 32 bytes you want the next line --
You have to build a loop for these reasons when using synchronous calls, too. You just don't usually notice this because the libraries take care of it for you. -- Greg
Sturla Molden wrote:
Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects.
Then the problem is polling for "ready-to-read" and "ready-to-write". The annoying part is that different types of files (disk files, sockets, pipes, named pipes, hardware devices) must be polled with different Windows API calls
I don't follow. Isn't the point of WaitForMultipleObjects that you can make a single call that blocks until any kind of object is ready? -- Greg
Den 3. nov. 2012 kl. 00:54 skrev Greg Ewing <greg.ewing@canterbury.ac.nz>:
I don't follow. Isn't the point of WaitForMultipleObjects that you can make a single call that blocks until any kind of object is ready?
WaitForMultipleObjects will wait for a "wait object" to be signalled – i.e. a thread, process, event, mutex, or semaphore handle. The Unix select() function signals that a file object is ready for read or write. There are different functions to poll file objects for readyness in Windows, depending on their type. That is different from Unix which treats all files the same. When WaitForMultipleObjects is used with overlapped i/o and IOCP, the OVERLAPPED struct has an event object that is signalled on completion (the hEvent member). It is not a wait on the file handle itself. WaitForMultipleObjects cannot wait for a file. Sturla
Working code or it didn't happen. (And it should scale too.) --Guido van Rossum (sent from Android phone) On Nov 2, 2012 2:58 PM, "Sturla Molden" <sturla@molden.no> wrote:
Den 19. okt. 2012 kl. 18:05 skrev Guido van Rossum <guido@python.org>:
An issue in the design of the I/O loop is the strain between a ready-based and completion-based design. The typical Unix design (whether based on select or any of the poll variants) is usually ready-based; but on Windows, the only way to get high performance is to base it on IOCP, which is completion-based (i.e. you start a specific async operation, like writing N bytes, and the I/O loop tells you when it is done). I would like people to be able to write fast event handling programs on Windows too, and ideally the only change would be the implementation of the I/O loop. But I don't know how tenable that is given the dramatically different style used by IOCP and the need to use native Windows API for all async I/O -- it sounds like we could only do this if the library providing the I/O loop implementation also wrapped all I/O operations, andthat may be a bit much.
Not really, no.
IOCP might be the easiest way to get high performance on Windows, but certainly not the only.
IOCP is a simple user-space wrapper for a thread-pool and overlapped (i.e. asynchronous) i/o. There is nothing IOCP can do that cannot be done with a pool of threads and non-blocking read or write operations.
Windows certainly has a function to select among multiple wait objects, called WaitForMultipleObjects. If open files are associated with event objects signalling "ready-to-read" or "ready-to-write", that is the basic machinery of an Unix select() function.
Then the problem is polling for "ready-to-read" and "ready-to-write". The annoying part is that different types of files (disk files, sockets, pipes, named pipes, hardware devices) must be polled with different Windows API calls – but there are non-blocking calls to poll them all. For this reason, Cygwin's select function spawn one thread to poll each type of file. Threads are very cheap on Windows, and polling loops can use Sleep(0) to relese the remainder of their time-slice, so this kind of polling is not very expensive. However, if we use a thread-pool for the polling, instead of spawing new threads on each call to select, we would be doing more or less the same as Windows built-in IOCPs, except we are signalling "ready" instead of "finished".
Thus, I think it is possible to get high performance without IOCP. But Microsoft has only implemented a select call for sockets. My suggestion would be to forget about IOCP and implement select for more than just sockets on Windows. The reason for this is that select and IOCP are signalling on different side of the I/O operation (ready vs. completed). So programs based on select ans IOCP tend to have opposite logics with respect to scheduling I/O. And as the general trend today is to develop for Unix and then port to Windows (as most programmers find the Windows API annoying), I think it would be better to port select (and perhaps poll and epoll) to Windows than provide IOCP to Python.
Sturla _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On 02/11/2012 11:59pm, Guido van Rossum wrote:
Working code or it didn't happen. (And it should scale too.)
I have some (mostly) working code which replaces tulip's "pollster" classes with "proactor" classes for select(), poll(), epoll() and IOCP. See https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee77... The IOCP proactor does not support ssl (or ipv6) so main.py does not succeed in downloading from xkcd.com using ssl. Using the other proactors it works correctly. The basic interface for the proactor looks like class Proactor: def recv(self, sock, n): ... def send(self, sock, buf): ... def connect(self, sock, address): ... def accept(self, sock): ... def poll(self, timeout=None): ... def pollable(self): ... recv(), send(), connect() and accept() initiate io operations and return futures. poll() returns a list of ready futures. pollable() returns true if there are any outstanding operations registered with the proactor. You use a pattern like f = proactor.recv(sock, 100) if not f.done(): yield from scheduling.block_future(f) res = f.result() -- Richard
This is awesome! I have to make time to understand in more detail how it works and what needs to change in the platform-independent API -- I want to get to the point where the *only* thing you change is the pollster/proactor (both kind of lame terms :-). I am guessing that the socket operations (or the factory for the transport class) needs to be made part of the pollster; the Twisted folks are telling me the same thing. FWIW, I've been studying other event loops. It's interesting to see the similarities (and differences) between e.g. the tulip eventloop, pyftpd's ioloop, Tornado's IOLoop, and 0MQ's IOLoop. The latter two look very similar, except that 0MQ makes the poller pluggable, but generally there are lots of similarities between the structure of all four. Twisted, as usual, stands apart. :-) --Guido On Sat, Nov 3, 2012 at 2:20 PM, Richard Oudkerk <shibturn@gmail.com> wrote:
On 02/11/2012 11:59pm, Guido van Rossum wrote:
Working code or it didn't happen. (And it should scale too.)
I have some (mostly) working code which replaces tulip's "pollster" classes with "proactor" classes for select(), poll(), epoll() and IOCP. See
https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee77...
The IOCP proactor does not support ssl (or ipv6) so main.py does not succeed in downloading from xkcd.com using ssl. Using the other proactors it works correctly.
The basic interface for the proactor looks like
class Proactor: def recv(self, sock, n): ... def send(self, sock, buf): ... def connect(self, sock, address): ... def accept(self, sock): ...
def poll(self, timeout=None): ... def pollable(self): ...
recv(), send(), connect() and accept() initiate io operations and return futures. poll() returns a list of ready futures. pollable() returns true if there are any outstanding operations registered with the proactor. You use a pattern like
f = proactor.recv(sock, 100) if not f.done(): yield from scheduling.block_future(f) res = f.result()
-- Richard
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- --Guido van Rossum (python.org/~guido)
On Sun, Nov 4, 2012 at 12:06 AM, Guido van Rossum <guido@python.org> wrote:
FWIW, I've been studying other event loops. It's interesting to see the similarities (and differences) between e.g. the tulip eventloop, pyftpd's ioloop, Tornado's IOLoop, and 0MQ's IOLoop. The latter two look very similar, except that 0MQ makes the poller pluggable, but generally there are lots of similarities between the structure of all four. Twisted, as usual, stands apart. :-)
AFAIK, the Twisted is the only framework from the listed ones which supports IOCP. This is probably the reason of why it's so different. -- Paul
On Sat, Nov 3, 2012 at 3:06 PM, Guido van Rossum <guido@python.org> wrote:
FWIW, I've been studying other event loops. It's interesting to see the similarities (and differences) between e.g. the tulip eventloop, pyftpd's ioloop, Tornado's IOLoop, and 0MQ's IOLoop. The latter two look very similar, except that 0MQ makes the poller pluggable, but generally there are lots of similarities between the structure of all four. Twisted, as usual, stands apart. :-)
Pyzmq's IOLoop is actually a fork/monkey-patch of Tornado's, and they have the same pluggable-poller implementation (In the master branch of Tornado it's been moved to the PollIOLoop subclass). -Ben
--Guido
On Sat, Nov 3, 2012 at 2:20 PM, Richard Oudkerk <shibturn@gmail.com> wrote:
On 02/11/2012 11:59pm, Guido van Rossum wrote:
Working code or it didn't happen. (And it should scale too.)
I have some (mostly) working code which replaces tulip's "pollster" classes with "proactor" classes for select(), poll(), epoll() and IOCP. See
https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee77...
The IOCP proactor does not support ssl (or ipv6) so main.py does not
in downloading from xkcd.com using ssl. Using the other proactors it works correctly.
The basic interface for the proactor looks like
class Proactor: def recv(self, sock, n): ... def send(self, sock, buf): ... def connect(self, sock, address): ... def accept(self, sock): ...
def poll(self, timeout=None): ... def pollable(self): ...
recv(), send(), connect() and accept() initiate io operations and return futures. poll() returns a list of ready futures. pollable() returns
succeed true
if there are any outstanding operations registered with the proactor. You use a pattern like
f = proactor.recv(sock, 100) if not f.done(): yield from scheduling.block_future(f) res = f.result()
-- Richard
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Sun, Nov 4, 2012 at 8:00 AM, Ben Darnell <ben@bendarnell.com> wrote:
On Sat, Nov 3, 2012 at 3:06 PM, Guido van Rossum <guido@python.org> wrote:
FWIW, I've been studying other event loops. It's interesting to see the similarities (and differences) between e.g. the tulip eventloop, pyftpd's ioloop, Tornado's IOLoop, and 0MQ's IOLoop. The latter two look very similar, except that 0MQ makes the poller pluggable, but generally there are lots of similarities between the structure of all four. Twisted, as usual, stands apart. :-)
Pyzmq's IOLoop is actually a fork/monkey-patch of Tornado's, and they have the same pluggable-poller implementation (In the master branch of Tornado it's been moved to the PollIOLoop subclass).
I was beginning to suspect as much. :-) Have you had the time to look at tulip's eventloop? I'd love your feedback: http://code.google.com/p/tulip/source/browse/polling.py Also, Richard has a modified version that supports IOCP, which changes the APIs around quite a bit. (Does Tornado try anything with IOCP? Does it even support Windows?) Any thoughts on this vs. my version? https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee77... -- --Guido van Rossum (python.org/~guido)
On Sat, 03 Nov 2012 21:20:18 +0000 Richard Oudkerk <shibturn@gmail.com> wrote:
On 02/11/2012 11:59pm, Guido van Rossum wrote:
Working code or it didn't happen. (And it should scale too.)
I have some (mostly) working code which replaces tulip's "pollster" classes with "proactor" classes for select(), poll(), epoll() and IOCP. See
https://bitbucket.org/sbt/tulip-proactor/changeset/c64ff42bf0f2679437838ee77...
The IOCP proactor does not support ssl (or ipv6) so main.py does not succeed in downloading from xkcd.com using ssl. Using the other proactors it works correctly.
It wouldn't be crazy to add an in-memory counterpart to SSLSocket in Python 3.4 (*). It could re-use the same underlying _ssl._SSLSocket, but initialized with a "memory BIO" in OpenSSL jargon. PyOpenSSL already has something similar, which is used in Twisted. (an in-memory SSL object probably only makes sense in non-blocking mode) (*) patches welcome :-) Regards Antoine.
On 3 November 2012 21:20, Richard Oudkerk <shibturn@gmail.com> wrote:
The IOCP proactor does not support ssl (or ipv6) so main.py does not succeed in downloading from xkcd.com using ssl. Using the other proactors it works correctly.
The basic interface for the proactor looks like
class Proactor: def recv(self, sock, n): ... def send(self, sock, buf): ... def connect(self, sock, address): ... def accept(self, sock): ...
def poll(self, timeout=None): ... def pollable(self): ...
I've just been looking at this, and from what I can see, am I right in thinking that the IOCP support is *only* for sockets? (I'm not very familiar with socket programming, so I had a bit of difficulty following the code). In particular, it can't be used to register non-socket file objects? From my understanding of the IOCP documentation on MSDN, this is fundamental - IOCP can only be used on HANDLE objects that have been opened with the FILE_FLAG_OVERLAPPED flag, which is not used by "normal" Python IO objects like file handles and pipes, so it will never be possible to poll these objects using IOCP. Just trying to make sure I understand the scope of this work... Paul
2013/1/16 Paul Moore <p.f.moore@gmail.com>
On 3 November 2012 21:20, Richard Oudkerk <shibturn@gmail.com> wrote:
The IOCP proactor does not support ssl (or ipv6) so main.py does not succeed in downloading from xkcd.com using ssl. Using the other proactors it works correctly.
The basic interface for the proactor looks like
class Proactor: def recv(self, sock, n): ... def send(self, sock, buf): ... def connect(self, sock, address): ... def accept(self, sock): ...
def poll(self, timeout=None): ... def pollable(self): ...
I've just been looking at this, and from what I can see, am I right in thinking that the IOCP support is *only* for sockets? (I'm not very familiar with socket programming, so I had a bit of difficulty following the code). In particular, it can't be used to register non-socket file objects? From my understanding of the IOCP documentation on MSDN, this is fundamental - IOCP can only be used on HANDLE objects that have been opened with the FILE_FLAG_OVERLAPPED flag, which is not used by "normal" Python IO objects like file handles and pipes, so it will never be possible to poll these objects using IOCP.
It works for disk files as well, but you indeed have to pass FILE_FLAG_OVERLAPPED when opening the file. This is similar to sockets: s.setblocking(False) is required for asynchronous writes to work. -- Amaury Forgeot d'Arc
On 16/01/2013 5:59pm, Paul Moore wrote:
I've just been looking at this, and from what I can see, am I right in thinking that the IOCP support is*only* for sockets? (I'm not very familiar with socket programming, so I had a bit of difficulty following the code). In particular, it can't be used to register non-socket file objects? From my understanding of the IOCP documentation on MSDN, this is fundamental - IOCP can only be used on HANDLE objects that have been opened with the FILE_FLAG_OVERLAPPED flag, which is not used by "normal" Python IO objects like file handles and pipes, so it will never be possible to poll these objects using IOCP.
Only sockets are supported because it uses WSARecv()/WSASend(), but it could very easily be made to use ReadFile()/WriteFile(). Then it would work with overlapped pipes (as currently used by multiprocessing) or other files openned with FILE_FLAG_OVERLAPPED. IOCP cannot be used with normal python file objects. But see http://bugs.python.org/issue12939 -- Richard
On 16 January 2013 18:54, Richard Oudkerk <shibturn@gmail.com> wrote:
Only sockets are supported because it uses WSARecv()/WSASend(), but it could very easily be made to use ReadFile()/WriteFile(). Then it would work with overlapped pipes (as currently used by multiprocessing) or other files openned with FILE_FLAG_OVERLAPPED.
Oh, cool. I hadn't checked the source to see if multiprocessing opened its pipes with FILE_FLAG_OVERLAPPED. Good to know it does. And yes, if normal file objects were opened that way, that would allow those to be used as well. Paul
On Fri, Nov 2, 2012 at 5:29 PM, Sturla Molden <sturla@molden.no> wrote:
Thus, I think it is possible to get high performance without IOCP. But Microsoft has only implemented a select call for sockets. My suggestion would be to forget about IOCP and implement select for more than just sockets on Windows. The reason for this is that select and IOCP are signalling on different side of the I/O operation (ready vs. completed). So programs based on select ans IOCP tend to have opposite logics with respect to scheduling I/O. And as the general trend today is to develop for Unix and then port to Windows (as most programmers find the Windows API annoying), I think it would be better to port select (and perhaps poll and epoll) to Windows than provide IOCP to Python.
Twisted supports both select()-style loops and IOCP, in a way that is transparent to user code. They key is presenting an async API to users (e.g. Protocol.dataReceived gets called with bytes), rather than e.g. trying to pretend they're talking to a socket-like object you can call recv() on.
On Sat, Nov 3, 2012 at 8:02 AM, Itamar Turner-Trauring < itamar@futurefoundries.com> wrote:
Twisted supports both select()-style loops and IOCP, in a way that is transparent to user code. They key is presenting an async API to users (e.g. Protocol.dataReceived gets called with bytes), rather than e.g. trying to pretend they're talking to a socket-like object you can call recv() on.
Although, if you're using a yield based API (or coroutines) you can have a recv()/read()-style API with IOCP as well.
participants (15)
-
Amaury Forgeot d'Arc
-
Antoine Pitrou
-
Ben Darnell
-
Glyph
-
Greg Ewing
-
Guido van Rossum
-
Itamar Turner-Trauring
-
Jasper St. Pierre
-
Kevin LaTona
-
Nick Coghlan
-
Paul Colomiets
-
Paul Moore
-
Richard Oudkerk
-
Sam Rushing
-
Sturla Molden