Tulip / PEP 3156 event loop implementation question: CPU vs. I/O starvation

Here's an interesting puzzle. Check out the core of Tulip's event loop: http://code.google.com/p/tulip/source/browse/tulip/unix_events.py#672 Specifically this does something like this: 1. poll for I/O, appending any ready handlers to the _ready queue 2. append any handlers scheduled for a time <= now to the _ready queue 3. while _ready: handler = _ready.popleft() call handler It is the latter loop that causes me some concern. In theory it is possible for a bad callback to make this loop never finish, as follows: def hogger(): tulip.get_event_loop().call_soon(hogger) Because call_soon() appends the handler to the _ready queue, the while loop will never finish. There is a simple enough solution (Tornado uses this AFAIK): now_ready = list(_ready) _ready.clear() for handler in now_ready: call handler However this implies that we go back to the I/O polling code more frequently. While the I/O polling code sets the timeout to zero when there's anything in the _ready queue, so it won't block, it still isn't free; it's an expensive system call that we'd like to put off until we have nothing better to do. I can imagine various patterns where handlers append other handlers to the _ready queue for immediate execution, and I'd make such patterns efficient (i.e. the user shouldn't have to worry about the cost of the I/O poll compared to the amount of work appended to the _ready queue). It is also convenient to say that a hogger that really wants to hog the CPU can do so anyway, e.g.: def hogger(): while True: pass However this would pretty much assume malice; real-life versions of the former hogger pattern may be spread across many callbacks and could be hard to recognize or anticipate. So what's more important? Avoid I/O starvation at all cost or make the callbacks-posting-callbacks pattern efficient? I can see several outcomes of this discussion: we could end up deciding that one or the other strategy is always best; we could also leave it up to the implementation (but then I still would want guidance for what to do in Tulip); we could even decide this is so important that the user needs to be able to control the policy here (though I hate having many configuration options, since in practice few people bother to take control, and you might as well have hard-coded the default...). Thoughts? Do I need to explain it better? -- --Guido van Rossum (python.org/~guido)

12.01.2013 00:41, Guido van Rossum wrote:
Maybe it could be, at least for the standard Tulip implementation, parameterizable with a simple integer value -- the suggested max number of loop iterations? E.g. something like the following: # `suggested_iter_limit` is the parameter actual_limit = max(len(_ready), suggested_iter_limit) for i in range(actual_limit): if not _ready: break handler = _ready.popleft() call handler... Regards. *j

Jan Kaliszewski, 12.01.2013 02:28:
Yep, it could simply use itertools.islice() when iterating over _ready with an appropriate upper bound factor relative to the actual length, and then cut down the list after the loop. So it would never go, say, 50% over the initially anticipated workload. Or rather a fixed number, I guess, to make it more predictable for users. That would be a user configurable parameter to the I/O loop. actual_limit = len(_ready) + max_additional_load_per_loop for handler in itertools.islice(_ready, None, actual_limit): call handler... del _ready[:actual_limit] Stefan

On Sat, Jan 12, 2013 at 4:39 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
But do we need that in the reference loop? It seems like additional complexity when it has yet to be demonstrated that the simple solution of alternating processing of call_soon registrations with IO callbacks is inadequate. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, 12 Jan 2013 19:53:27 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Why do you talk about "reference loop"? It should be usable in production, not some kind of demonstration system that people will have to replace with a third-party library to get decent results. Regards Antoine.

On Sat, Jan 12, 2013 at 8:38 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I mean "reference loop" in the same sense that CPython is the "reference interpreter". You can get a lot more cool stuff by upgrading to, e.g. IPython, as your interactive interpreter, or using an interactive debugger other than pdb, but that doesn't mean all of those enhancements should be folded back into the core. In this case, we have a feature where there is a reasonable default behaviour (i.e. alternating between processing ready calls and checking for triggering of IO callbacks as Guido suggested), and no compelling evidence to justify a more complex solution. Ergo, the reference loop should use the simple approach, until such evidence is provided. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jan 12, 2013 at 9:41 AM, Guido van Rossum <guido@python.org> wrote:
Given the availability of "yield from" as a tool for efficiently invoking other asynchronous operations without hitting the event loop at all, it seems to me that it is more appropriate to avoid IO starvation by interleaving IO event processing and ready callback processing. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jan 12, 2013 at 1:08 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Oops, I meant to include a link to http://bugs.python.org/issue7946, which is about the convoy effect created by the GIL implementation when I/O bound threads are processed in the presence of a CPU bound thread (essentially, the I/O latency increases to the GIL check interval). (The thing that changed in 3.2 is that the magnitude of the convoy effect is now independent of the work-per-bytecode in the CPU bound thread) That's what makes me think always alternating between processing ready callbacks and checking for IO events is the right thing to do. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 January 2013 12:41, Guido van Rossum <guido@python.org> wrote:
How expensive is it really? If its select, its terrible, but we shouldn't be using that anywhere. if its poll() it is moderately expensive, but still it doesn't scale - its linear with fd's. If its IO Completion ports in Windows, it is approximately free - the OS calls back into us every time we tell it we're ready for more events. And if its epoll it is also basically free, reading off of an event queue rather than checking every entry in the array. kqueue has similar efficiency, for BSD systems. I'd want to see some actual numbers before assuming that the call into epoll or completion is actually a driving factor in latency here. -Rob

On Sat, Jan 12, 2013 at 1:41 AM, Guido van Rossum <guido@python.org> wrote:
[...]
def hogger():
I read your statements as: * I don't want the user to cause IO starvation * I want the user to cause IO starvation Which means you have two options: * Make an opinionated decision that won't be perfect for everyone (not as bad as it sounds) * Allow configurability IMO core event loops need this configurability but not on a daily basis, e.g. Windows XP's event loop gave priority to the foreground process (i.e. UI events) and Windows Server 2003 gave priority to background processes. e.g. (warning unoptimized pseudocode follows) while True: for i in range(io_weight): pop_io() for i in range(event_weight): pop_ready() # note one of the weights can be zero

Hi, On Sat, Jan 12, 2013 at 12:41 AM, Guido van Rossum <guido@python.org> wrote:
Adding a poll-able descriptor to the the loop will eval it in the next iteration of the loop, so why make a difference with timers? Define call_soon to be called in the next iteration - not in the same. Basically every modification of the event loop should be evaluated in the next iteration, not the same. MfG Markus

On Sat, Jan 12, 2013 at 8:01 AM, Markus <nepenthesdev@gmail.com> wrote:
We're looking for a "fair" scheduling algorithm here, and I think this describes it. Everything else should be an optimization from here. For example, if the event loop "detects" that it is CPU-bound, perhaps it skips some fraction of the relatively expensive calls to the IO check. I have no idea how to do such "detection" efficiently. Maybe just count the number of consecutive IO checks with timeout=0 that returned no actionable IO, and skip that number of checks. run_ready_queue() check_io(timeout=0) -> no IO, counter becomes 1 run_ready_queue() (skip 1) run_ready_queue() check_io(timeout=0) -> no IO, counter becomes 2 run_ready_queue() (skip 1) run_ready_queue() (skip 2) run_ready_queue() check_io(timeout=0) ... There would be some limits, of course, and this should probably be based on time, not cycles. It's not clear what to do when IO *does* occur -- divide the counter by 2? At any rate, this is an O(1) change to the the event loop that would get some interesting adaptive behavior, while still maintaining its fairness. IMHO the PEP should leave this unspecified, perhaps suggesting only that event loops have clear documentations regarding their fairness. Then users can select event loops based on their needs. Dustin

On Fri, Jan 11, 2013 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:
This is actually a useful pattern, not just a pathological "bad callback". If the function does some work before re-adding itself, it allows for better multitasking kind of like doing the work in another thread (with a starvation-free event loop). If the event loop starves IO in this case it's difficult to get this kind of non-blocking background execution (you have to use call_later with a non-zero timeout, slowing the work down unnecessarily).
In isolation, yes. Under real-world load, it's less clear. A zero-timeout poll that has nothing to return is in some sense wasted work, but if there's other stuff going on then we may just change the timing of the poll calls rather than inserting additional ones.
I'm not sure it's worth the complexity to offer both, so I'd be inclined to just have the starvation-free version. -Ben

Thanks all! It is clear what to do now. Run all those handlers that are currently ready but not those added during this run. -- --Guido van Rossum (python.org/~guido)

12.01.2013 00:41, Guido van Rossum wrote:
Maybe it could be, at least for the standard Tulip implementation, parameterizable with a simple integer value -- the suggested max number of loop iterations? E.g. something like the following: # `suggested_iter_limit` is the parameter actual_limit = max(len(_ready), suggested_iter_limit) for i in range(actual_limit): if not _ready: break handler = _ready.popleft() call handler... Regards. *j

Jan Kaliszewski, 12.01.2013 02:28:
Yep, it could simply use itertools.islice() when iterating over _ready with an appropriate upper bound factor relative to the actual length, and then cut down the list after the loop. So it would never go, say, 50% over the initially anticipated workload. Or rather a fixed number, I guess, to make it more predictable for users. That would be a user configurable parameter to the I/O loop. actual_limit = len(_ready) + max_additional_load_per_loop for handler in itertools.islice(_ready, None, actual_limit): call handler... del _ready[:actual_limit] Stefan

On Sat, Jan 12, 2013 at 4:39 PM, Stefan Behnel <stefan_ml@behnel.de> wrote:
But do we need that in the reference loop? It seems like additional complexity when it has yet to be demonstrated that the simple solution of alternating processing of call_soon registrations with IO callbacks is inadequate. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, 12 Jan 2013 19:53:27 +1000 Nick Coghlan <ncoghlan@gmail.com> wrote:
Why do you talk about "reference loop"? It should be usable in production, not some kind of demonstration system that people will have to replace with a third-party library to get decent results. Regards Antoine.

On Sat, Jan 12, 2013 at 8:38 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I mean "reference loop" in the same sense that CPython is the "reference interpreter". You can get a lot more cool stuff by upgrading to, e.g. IPython, as your interactive interpreter, or using an interactive debugger other than pdb, but that doesn't mean all of those enhancements should be folded back into the core. In this case, we have a feature where there is a reasonable default behaviour (i.e. alternating between processing ready calls and checking for triggering of IO callbacks as Guido suggested), and no compelling evidence to justify a more complex solution. Ergo, the reference loop should use the simple approach, until such evidence is provided. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jan 12, 2013 at 9:41 AM, Guido van Rossum <guido@python.org> wrote:
Given the availability of "yield from" as a tool for efficiently invoking other asynchronous operations without hitting the event loop at all, it seems to me that it is more appropriate to avoid IO starvation by interleaving IO event processing and ready callback processing. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Jan 12, 2013 at 1:08 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Oops, I meant to include a link to http://bugs.python.org/issue7946, which is about the convoy effect created by the GIL implementation when I/O bound threads are processed in the presence of a CPU bound thread (essentially, the I/O latency increases to the GIL check interval). (The thing that changed in 3.2 is that the magnitude of the convoy effect is now independent of the work-per-bytecode in the CPU bound thread) That's what makes me think always alternating between processing ready callbacks and checking for IO events is the right thing to do. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On 12 January 2013 12:41, Guido van Rossum <guido@python.org> wrote:
How expensive is it really? If its select, its terrible, but we shouldn't be using that anywhere. if its poll() it is moderately expensive, but still it doesn't scale - its linear with fd's. If its IO Completion ports in Windows, it is approximately free - the OS calls back into us every time we tell it we're ready for more events. And if its epoll it is also basically free, reading off of an event queue rather than checking every entry in the array. kqueue has similar efficiency, for BSD systems. I'd want to see some actual numbers before assuming that the call into epoll or completion is actually a driving factor in latency here. -Rob

On Sat, Jan 12, 2013 at 1:41 AM, Guido van Rossum <guido@python.org> wrote:
[...]
def hogger():
I read your statements as: * I don't want the user to cause IO starvation * I want the user to cause IO starvation Which means you have two options: * Make an opinionated decision that won't be perfect for everyone (not as bad as it sounds) * Allow configurability IMO core event loops need this configurability but not on a daily basis, e.g. Windows XP's event loop gave priority to the foreground process (i.e. UI events) and Windows Server 2003 gave priority to background processes. e.g. (warning unoptimized pseudocode follows) while True: for i in range(io_weight): pop_io() for i in range(event_weight): pop_ready() # note one of the weights can be zero

Hi, On Sat, Jan 12, 2013 at 12:41 AM, Guido van Rossum <guido@python.org> wrote:
Adding a poll-able descriptor to the the loop will eval it in the next iteration of the loop, so why make a difference with timers? Define call_soon to be called in the next iteration - not in the same. Basically every modification of the event loop should be evaluated in the next iteration, not the same. MfG Markus

On Sat, Jan 12, 2013 at 8:01 AM, Markus <nepenthesdev@gmail.com> wrote:
We're looking for a "fair" scheduling algorithm here, and I think this describes it. Everything else should be an optimization from here. For example, if the event loop "detects" that it is CPU-bound, perhaps it skips some fraction of the relatively expensive calls to the IO check. I have no idea how to do such "detection" efficiently. Maybe just count the number of consecutive IO checks with timeout=0 that returned no actionable IO, and skip that number of checks. run_ready_queue() check_io(timeout=0) -> no IO, counter becomes 1 run_ready_queue() (skip 1) run_ready_queue() check_io(timeout=0) -> no IO, counter becomes 2 run_ready_queue() (skip 1) run_ready_queue() (skip 2) run_ready_queue() check_io(timeout=0) ... There would be some limits, of course, and this should probably be based on time, not cycles. It's not clear what to do when IO *does* occur -- divide the counter by 2? At any rate, this is an O(1) change to the the event loop that would get some interesting adaptive behavior, while still maintaining its fairness. IMHO the PEP should leave this unspecified, perhaps suggesting only that event loops have clear documentations regarding their fairness. Then users can select event loops based on their needs. Dustin

On Fri, Jan 11, 2013 at 6:41 PM, Guido van Rossum <guido@python.org> wrote:
This is actually a useful pattern, not just a pathological "bad callback". If the function does some work before re-adding itself, it allows for better multitasking kind of like doing the work in another thread (with a starvation-free event loop). If the event loop starves IO in this case it's difficult to get this kind of non-blocking background execution (you have to use call_later with a non-zero timeout, slowing the work down unnecessarily).
In isolation, yes. Under real-world load, it's less clear. A zero-timeout poll that has nothing to return is in some sense wasted work, but if there's other stuff going on then we may just change the timing of the poll calls rather than inserting additional ones.
I'm not sure it's worth the complexity to offer both, so I'd be inclined to just have the starvation-free version. -Ben

Thanks all! It is clear what to do now. Run all those handlers that are currently ready but not those added during this run. -- --Guido van Rossum (python.org/~guido)
participants (10)
-
Antoine Pitrou
-
Ben Darnell
-
Dustin J. Mitchell
-
Guido van Rossum
-
Jan Kaliszewski
-
Markus
-
Nick Coghlan
-
Robert Collins
-
Stefan Behnel
-
Yuval Greenfield