[Python-ideas] Membership of infinite iterators

Wed Nov 1 21:39:03 EDT 2017

On 2 November 2017 at 04:29, Koos Zevenhoven <k7hoven at gmail.com> wrote:

> SPOILER ALERT! At the moment, Nick's statement is in fact **always** true
> in **all** cases (at least when ignoring untypical cases and some
> inaccuracies in phrasing). Another question is whether the statement
> **should** be true at all.
> 
>
> PyErr_CheckSignals(), the function that checks for pending signals, now
> **implicitly** uses the strictest possible memory-order requirement
> (SEQ_CST) for checking the `is_tripped` flag, a value which can be used to
> peek whether there are any pending signals. This means that two threads
> that call PyErr_CheckSignals can't even **check** the value of that flag
> at the same time, and they'll have to wait for each other and whatever the
> CPU hardware implementation does for cache synchronzation.
>

Nice, that's deeper than I went - I just assumed there was an
interlocked_increment in there somewhere, without checking whether or not
there were any optimised code paths that managed to avoid that call :)

> From a correctness point of view, that is absolutely great: if
> PyErr_CheckSignals() is called, it is guaranteed to notice a new signal
> regardles of how small the number of picoseconds after the `is_tripped`
> flag has been set.  But is that really important? Do we really need things
> to be slowed down by this? And do we want people to "optimize" code by
> avoiding signal checking?
>

It isn't signal handling latency that's the concern here, it's potentially
missing signals. Consider the case where we have 3 threads: A, B, C. A is
the main thread that will actually handle signals, B and C both happened to
receive them. We'll also assume  the two threads receive *different*
signals (otherwise they'll potentially get collapsed into one notification
regardless).

The scenario we want to avoid is having one or both of the signals set, but
is_tripped cleared. With an interlocked check (where both 0->1 and 1->0 are
memory barriers), that clearly can't happen. I suspect you're right that
this could also be achieved with a less strict memory sync requirement on
the 0->1 check, but that's much harder to reason about than simply limiting
the number of iterations we make through an iterator consumption loop
before checking for signals.

> The signals won't be caught until checked anyway, and as we've seen, one
> solution is to use a counter to determine if enough time has passed that
> another check for potential signals should happen. That does, however,
> raise the question of who should own the counter, and if it's OK for
> multiple threads to use the same counter. If yes, then would we be able to
> avoid slow atomic decrements (or increments)?
>
> But another solution might be to make a less strict but almost equivalent
> functionality with much less overhead. Something like this in a header file:
>
> static inline int PyErr_PROBE_SIGNALS() {
>     static int volatile *flag = (int volatile *) &is_tripped;
>     if (*flag) {
>         return PyErr_CheckSignals();
>     }
>     else {
>         return 0;
>     }
> }
>
> Here, the flag is casted to volatile to force a check to happen each time
> from memory/cache. However, to make it more standard and to make sure it
> works with all compilers/platforms, it might be better to, instead of
> *flag, use an atomic load with "MEMORY_ORDER_RELAXED". Another thing is
> that `is_tripped`, which is currently declared as static inside
> signalmodule.c [4], should somehow be exposed in the API to make it
> available for the inline function in headers.
>
> This solution might be fast enough for a lot of cases, although still
> slightly slower than the counter approach, at least if the counter approach
> would completely avoid per-iteration memory access by requiring each
> function to own the counter that is used for this.
>

One thing I like about a nested inner loop is that it's easy to understand
the rationale for it *without* knowing any of the details of how memory
synchronisation works, as it's just:

- we want to check for signals regularly so the latency in interrupt
handling isn't too high
- PyErr_CheckSignals() is too expensive to call on every iteration
- so we only check every N iterations

Memory synchronisation then only comes up if someone asks why
"PyErr_CheckSignals" is expensive to call. And while it wouldn't surprise
at all if you're right and there are ways to make that call cheaper,
they're still never going to be cheaper than explicitly reducing the
frequency with which it is called.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20171102/a0a84550/attachment-0001.html>