Re: [Python-ideas] Membership of infinite iterators

1 Nov 2017

      Let me add something to the discussion that we had last month about code
that cannot be interrupted with Ctrl-C or with the Jupyter Notebook
interrupt command etc. I wasn't completely sure if anyone cared much about
being able to interrupt long-running code and about things like atomic
operations etc., but maybe I'll just write this one down anyway while it's
still in my head:

As a reminder of what we discussed, for KeyboardInterrupt to happen, C code
needs to manually call PyErr_CheckSignals() regularly to handle signals,
but it might be too much overhead to do that in tight loops.  But where
does the overhead come from and what could perhaps be done? Read below:

On Thu, Oct 19, 2017 at 10:54 AM, Koos Zevenhoven  wrote:
...
On Thu, Oct 19, 2017 at 3:42 AM, Nick Coghlan  wrote:
...
On 19 October 2017 at 08:34, Greg Ewing 
wrote:
...
Nick Coghlan wrote:
...
since breaking up the current single level loops as nested loops would
be a pre-requisite for allowing these APIs to check for signals while
they're running while keeping the per-iteration overhead low
Is there really much overhead? Isn't it just checking a flag?
It's checking an atomically updated flag, so it forces CPU cache
synchronisation, which means you don't want to be doing it on every
iteration of a low level loop.
Even just that it's a C function call makes me not want to recommend doing
it in a lot of tight loops. Who knows what the function does anyway, let
alone what it might or might not do in the future.
While what Nick states above about CPU cache synchronization doesn't sound
quite **right**, it may take a core developer a whole day to roughly figure
out if the statement is even **true**. In particular, it is not obvious if
it may **occasionally** be true, or if it is **always** true in **some
cases**, or if it's *sometimes* true in *all cases*.

The difficulties are partly because pyatomic.h [1] is "arranged" in layers
of branched C preprocessor directives [2], and because the whole of
pyatomic.h is in fact "deep magic", as is confirmed by a comment in the
beginning of the file. Some of the complexity has emerged after that
comment was already there.

SPOILER ALERT! At the moment, Nick's statement is in fact **always** true
in **all** cases (at least when ignoring untypical cases and some
inaccuracies in phrasing). Another question is whether the statement
**should** be true at all.

PyErr_CheckSignals(), the function that checks for pending signals, now
**implicitly** uses the strictest possible memory-order requirement
(SEQ_CST) for checking the `is_tripped` flag, a value which can be used to
peek whether there are any pending signals. This means that two threads
that call PyErr_CheckSignals can't even **check** the value of that flag at
the same time, and they'll have to wait for each other and whatever the CPU
hardware implementation does for cache synchronzation.

From a correctness point of view, that is absolutely great: if
PyErr_CheckSignals() is called, it is guaranteed to notice a new signal
regardles of how small the number of picoseconds after the `is_tripped`
flag has been set.  But is that really important? Do we really need things
to be slowed down by this? And do we want people to "optimize" code by
avoiding signal checking?

The signals won't be caught until checked anyway, and as we've seen, one
solution is to use a counter to determine if enough time has passed that
another check for potential signals should happen. That does, however,
raise the question of who should own the counter, and if it's OK for
multiple threads to use the same counter. If yes, then would we be able to
avoid slow atomic decrements (or increments)?

But another solution might be to make a less strict but almost equivalent
functionality with much less overhead. Something like this in a header file:

static inline int PyErr_PROBE_SIGNALS() {
    static int volatile *flag = (int volatile *) &is_tripped;
    if (*flag) {
        return PyErr_CheckSignals();
    }
    else {
        return 0;
    }
}

Here, the flag is casted to volatile to force a check to happen each time
from memory/cache. However, to make it more standard and to make sure it
works with all compilers/platforms, it might be better to, instead of
*flag, use an atomic load with "MEMORY_ORDER_RELAXED". Another thing is
that `is_tripped`, which is currently declared as static inside
signalmodule.c [4], should somehow be exposed in the API to make it
available for the inline function in headers.

This solution might be fast enough for a lot of cases, although still
slightly slower than the counter approach, at least if the counter approach
would completely avoid per-iteration memory access by requiring each
function to own the counter that is used for this.

So you read the whole email? Cool!

––Koos

[1] https://github.com/python/cpython/blob/master/Include/pyatomic.h

[2] C preprocessor code is kind of like commented-out Python code, except
that not even the Spanish Inquisition [3] expects you to indent the blocks
that they define.

[3] https://www.youtube.com/watch?v=Nf_Y4MbUCLY

[4] https://github.com/python/cpython/blob/master/Modules/signalmodule.c
...
...
However, reviewing Serhiy's PR reminded me that PyErr_CheckSignals()
already encapsulates the "Should this thread even be checking for signals
in the first place?" logic, which means the code change to make the
itertools iterators inherently interruptible with Ctrl-C is much smaller
than I thought it would be.
And if it didn't encapsulate that, you would probably have written a
wrapper that does. Good thing it's the wrapper that's exposed in the API.
...
That approach is also clearly safe from an exception handling point of
view, since all consumer loops already need to cope with the fact that
itr.__next__() may raise arbitrary exceptions (including KeyboardInterrupt).
So that change alone already offers a notable improvement, and combining
...
it with a __length_hint__() implementation that keeps container
constructors from even starting to iterate would go even further towards
making the infinite iterators more user friendly.
Similar signal checking changes to the consumer loops would also be
possible, but I don't think that's an either/or decision: changing the
iterators means they'll be interruptible for any consumer, while changing
the consumers would make them interruptible for any iterator, and having
checks in both the producer & the consumer merely means that you'll be
checking for signals twice every 65k iterations, rather than once.
Indeed it's not strictly an either/or decision, but more about where we
might spend time executing C code. But I'm leaning a bit towards doing it
on the consumer side, because there it's more obvious that the code might
take some time to run.
If the consumer ends up iterating over pure-Python objects, there are no
concerns about the overhead. But if it *does* call a C-implemented
__next__, then that's the case where we actully need the whole thing.
Adding the check in both places would double the (small) overhead. And
nested (wrapped) iterators are also a thing.
––Koos
--
+ Koos Zevenhoven + http://twitter.com/k7hoven +
-- 
+ Koos Zevenhoven + http://twitter.com/k7hoven +