Let me add something to the discussion that we had last month about code that cannot be interrupted with Ctrl-C or with the Jupyter Notebook interrupt command etc. I wasn't completely sure if anyone cared much about being able to interrupt long-running code and about things like atomic operations etc., but maybe I'll just write this one down anyway while it's still in my head: As a reminder of what we discussed, for KeyboardInterrupt to happen, C code needs to manually call PyErr_CheckSignals() regularly to handle signals, but it might be too much overhead to do that in tight loops. But where does the overhead come from and what could perhaps be done? Read below: On Thu, Oct 19, 2017 at 10:54 AM, Koos Zevenhoven <k7hoven@gmail.com> wrote:
On Thu, Oct 19, 2017 at 3:42 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
On 19 October 2017 at 08:34, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:
Nick Coghlan wrote:
since breaking up the current single level loops as nested loops would be a pre-requisite for allowing these APIs to check for signals while they're running while keeping the per-iteration overhead low
Is there really much overhead? Isn't it just checking a flag?
It's checking an atomically updated flag, so it forces CPU cache synchronisation, which means you don't want to be doing it on every iteration of a low level loop.
Even just that it's a C function call makes me not want to recommend doing it in a lot of tight loops. Who knows what the function does anyway, let alone what it might or might not do in the future.
While what Nick states above about CPU cache synchronization doesn't sound quite **right**, it may take a core developer a whole day to roughly figure out if the statement is even **true**. In particular, it is not obvious if it may **occasionally** be true, or if it is **always** true in **some cases**, or if it's *sometimes* true in *all cases*. The difficulties are partly because pyatomic.h [1] is "arranged" in layers of branched C preprocessor directives [2], and because the whole of pyatomic.h is in fact "deep magic", as is confirmed by a comment in the beginning of the file. Some of the complexity has emerged after that comment was already there. SPOILER ALERT! At the moment, Nick's statement is in fact **always** true in **all** cases (at least when ignoring untypical cases and some inaccuracies in phrasing). Another question is whether the statement **should** be true at all. PyErr_CheckSignals(), the function that checks for pending signals, now **implicitly** uses the strictest possible memory-order requirement (SEQ_CST) for checking the `is_tripped` flag, a value which can be used to peek whether there are any pending signals. This means that two threads that call PyErr_CheckSignals can't even **check** the value of that flag at the same time, and they'll have to wait for each other and whatever the CPU hardware implementation does for cache synchronzation. From a correctness point of view, that is absolutely great: if PyErr_CheckSignals() is called, it is guaranteed to notice a new signal regardles of how small the number of picoseconds after the `is_tripped` flag has been set. But is that really important? Do we really need things to be slowed down by this? And do we want people to "optimize" code by avoiding signal checking? The signals won't be caught until checked anyway, and as we've seen, one solution is to use a counter to determine if enough time has passed that another check for potential signals should happen. That does, however, raise the question of who should own the counter, and if it's OK for multiple threads to use the same counter. If yes, then would we be able to avoid slow atomic decrements (or increments)? But another solution might be to make a less strict but almost equivalent functionality with much less overhead. Something like this in a header file: static inline int PyErr_PROBE_SIGNALS() { static int volatile *flag = (int volatile *) &is_tripped; if (*flag) { return PyErr_CheckSignals(); } else { return 0; } } Here, the flag is casted to volatile to force a check to happen each time from memory/cache. However, to make it more standard and to make sure it works with all compilers/platforms, it might be better to, instead of *flag, use an atomic load with "MEMORY_ORDER_RELAXED". Another thing is that `is_tripped`, which is currently declared as static inside signalmodule.c [4], should somehow be exposed in the API to make it available for the inline function in headers. This solution might be fast enough for a lot of cases, although still slightly slower than the counter approach, at least if the counter approach would completely avoid per-iteration memory access by requiring each function to own the counter that is used for this. So you read the whole email? Cool! ––Koos [1] https://github.com/python/cpython/blob/master/Include/pyatomic.h [2] C preprocessor code is kind of like commented-out Python code, except that not even the Spanish Inquisition [3] expects you to indent the blocks that they define. [3] https://www.youtube.com/watch?v=Nf_Y4MbUCLY [4] https://github.com/python/cpython/blob/master/Modules/signalmodule.c
However, reviewing Serhiy's PR reminded me that PyErr_CheckSignals() already encapsulates the "Should this thread even be checking for signals in the first place?" logic, which means the code change to make the itertools iterators inherently interruptible with Ctrl-C is much smaller than I thought it would be.
And if it didn't encapsulate that, you would probably have written a wrapper that does. Good thing it's the wrapper that's exposed in the API.
That approach is also clearly safe from an exception handling point of view, since all consumer loops already need to cope with the fact that itr.__next__() may raise arbitrary exceptions (including KeyboardInterrupt).
So that change alone already offers a notable improvement, and combining
it with a __length_hint__() implementation that keeps container constructors from even starting to iterate would go even further towards making the infinite iterators more user friendly.
Similar signal checking changes to the consumer loops would also be possible, but I don't think that's an either/or decision: changing the iterators means they'll be interruptible for any consumer, while changing the consumers would make them interruptible for any iterator, and having checks in both the producer & the consumer merely means that you'll be checking for signals twice every 65k iterations, rather than once.
Indeed it's not strictly an either/or decision, but more about where we might spend time executing C code. But I'm leaning a bit towards doing it on the consumer side, because there it's more obvious that the code might take some time to run.
If the consumer ends up iterating over pure-Python objects, there are no concerns about the overhead. But if it *does* call a C-implemented __next__, then that's the case where we actully need the whole thing. Adding the check in both places would double the (small) overhead. And nested (wrapped) iterators are also a thing.
––Koos
-- + Koos Zevenhoven + http://twitter.com/k7hoven +
-- + Koos Zevenhoven + http://twitter.com/k7hoven +