On 2 November 2017 at 04:29, Koos Zevenhoven <k7hoven@gmail.com> wrote:

SPOILER ALERT! At the moment, Nick's statement is in fact *always* true in *all* cases (at least when ignoring untypical cases and some inaccuracies in phrasing). Another question is whether the statement *should* be true at all.

PyErr_CheckSignals(), the function that checks for pending signals, now *implicitly* uses the strictest possible memory-order requirement (SEQ_CST) for checking the `is_tripped` flag, a value which can be used to peek whether there are any pending signals. This means that two threads that call PyErr_CheckSignals can't even *check* the value of that flag at the same time, and they'll have to wait for each other and whatever the CPU hardware implementation does for cache synchronzation.

Nice, that's deeper than I went - I just assumed there was an interlocked_increment in there somewhere, without checking whether or not there were any optimised code paths that managed to avoid that call :)

From a correctness point of view, that is absolutely great: if PyErr_CheckSignals() is called, it is guaranteed to notice a new signal regardles of how small the number of picoseconds after the `is_tripped` flag has been set. But is that really important? Do we really need things to be slowed down by this? And do we want people to "optimize" code by avoiding signal checking?

It isn't signal handling latency that's the concern here, it's potentially missing signals. Consider the case where we have 3 threads: A, B, C. A is the main thread that will actually handle signals, B and C both happened to receive them. We'll also assume the two threads receive *different* signals (otherwise they'll potentially get collapsed into one notification regardless).

The scenario we want to avoid is having one or both of the signals set, but is_tripped cleared. With an interlocked check (where both 0->1 and 1->0 are memory barriers), that clearly can't happen. I suspect you're right that this could also be achieved with a less strict memory sync requirement on the 0->1 check, but that's much harder to reason about than simply limiting the number of iterations we make through an iterator consumption loop before checking for signals.

The signals won't be caught until checked anyway, and as we've seen, one solution is to use a counter to determine if enough time has passed that another check for potential signals should happen. That does, however, raise the question of who should own the counter, and if it's OK for multiple threads to use the same counter. If yes, then would we be able to avoid slow atomic decrements (or increments)?

But another solution might be to make a less strict but almost equivalent functionality with much less overhead. Something like this in a header file:

static inline int PyErr_PROBE_SIGNALS() {
static int volatile *flag = (int volatile *) &is_tripped;
if (*flag) {
return PyErr_CheckSignals();
}
else {
return 0;
}
}

Here, the flag is casted to volatile to force a check to happen each time from memory/cache. However, to make it more standard and to make sure it works with all compilers/platforms, it might be better to, instead of *flag, use an atomic load with "MEMORY_ORDER_RELAXED". Another thing is that `is_tripped`, which is currently declared as static inside signalmodule.c [4], should somehow be exposed in the API to make it available for the inline function in headers.

This solution might be fast enough for a lot of cases, although still slightly slower than the counter approach, at least if the counter approach would completely avoid per-iteration memory access by requiring each function to own the counter that is used for this.

One thing I like about a nested inner loop is that it's easy to understand the rationale for it *without* knowing any of the details of how memory synchronisation works, as it's just:

- we want to check for signals regularly so the latency in interrupt handling isn't too high

- PyErr_CheckSignals() is too expensive to call on every iteration

- so we only check every N iterations

Memory synchronisation then only comes up if someone asks why "PyErr_CheckSignals" is expensive to call. And while it wouldn't surprise at all if you're right and there are ways to make that call cheaper, they're still never going to be cheaper than explicitly reducing the frequency with which it is called.

Cheers,

Nick.

Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia