Allow signal suppression

Hey everyone, I've been developing code which (alas) needs to operate in a runtime environment which is quite *enthusiastic* about sending SIGTERMs and the like, and where there are critical short sections of code that, if interrupted, are very hard to resume without some user-visible anomaly happening. This means getting to know the signal handling logic far too well. In particular, it means that preventing signals in a "dangerous window" is very difficult in the current language: while you can change the signal handlers to "suppressing" handlers and then restore them from the main thread, if you have potentially critical regions running in any non-main thread, there's no good way for them to tell the main thread to change the handlers... except by sending the main thread a signal. That requires the "suppressing" handler to have nontrivial logic in it, *but* Python signal handlers are reentrant: signals are not suppressed *during a signal handler*, and so hilarity ensues. Digging through all of this, though, there seems to be one interesting thing that could be done in CPython in particular, and so I have a possibly-crazy proposal for a runtime-specific extension. *Proposal: Add (as an optional part of the spec, runtimes may choose to implement if they wish) sys.suppress_signals(bool).* Calling this function with a value of True would defer all signal handling until it was called again with a value of False. The reason this is potentially insane is that suppressing signals does all sorts of things: e.g., it prevents keyboard interrupts, blocks ENOPIPE errors if you try to write to a pipe with a terminated peer, stops child processes from reporting their exit to the parent, screws with profiling timers, and so on. This would be an a proper footgun if misused, the sort of thing you should only activate if you actually understand POSIX signals in depth. The implementation in CPython would be surprisingly simple: simply store the boolean value in an atomic int, and add a second check at the start of _PyErr_CheckSignalsTstate <https://github.com/python/cpython/blob/master/Modules/signalmodule.c#L1693>, prior to clearing is_tripped. How insane does this idea sound to people? Yonatan -- Yonatan Zunger Distinguished Engineer and Chief Ethics Officer He / Him zunger@humu.com 100 View St, Suite 101 Mountain View, CA 94041 Humu.com <https://www.humu.com> · LinkedIn <https://www.linkedin.com/company/humuhq> · Twitter <https://twitter.com/humuinc>

Absolutely, but I figured the natural thing to expose from the C API was a very minimal function, and then put a context manager in the Python layer. The actual context manager implementation I would use would be a bit smarter than a bare set/reset -- it would use an unbounded semaphore and only unsuppress when the count dropped to zero. That way, multiple threads can simultaneously have critical regions, and suppression would happen whenever anyone was suppressing. On Thu, Jun 25, 2020 at 1:54 PM MRAB <python@mrabarnett.plus.com> wrote:
-- Yonatan Zunger Distinguished Engineer and Chief Ethics Officer He / Him zunger@humu.com 100 View St, Suite 101 Mountain View, CA 94041 Humu.com <https://www.humu.com> · LinkedIn <https://www.linkedin.com/company/humuhq> · Twitter <https://twitter.com/humuinc>

On Thu, Jun 25, 2020 at 5:09 PM Yonatan Zunger via Python-ideas < python-ideas@python.org> wrote:
I find, for reasons you have already mentioned, having a "suppress all signals" something _really_ strange in userland code. But maybe I just have never seen a case in which it makes sense. Are you sure that the problem isn't "a runtime environment which is quite enthusiastic about sending SIGTERMs"?

Oh, that's definitely part of the problem, but that is *far* beyond my ability to fix. Right now I'm still working on getting its owners to do things like "could you please log somewhere when you kill jobs, and maybe even indicate why the job was killed?". The main time that signals show up in life outside of writing device drivers and the like is when implementing or interacting with runtime environments, basically, as they're the mechanism of interruptive interprocess communication, noncooperative scheduling control, and so on. Typical horrifying example: One of the systems I've do control is the shell that runs cron jobs (not their scheduling, but their actual execution) which needs to provide an outer harness that manages getting commands from the scheduler, integration with all sorts of logging and monitoring systems, etc., and needs to actually execute the Python code of the real jobs inside it. It needs to do various kinds of noncooperative scheduling to those subtasks (timeouts, killing and replacing workers under various circumstances, etc) and so runs them in a subprocess. So I get several layers of signals: incoming ones from the SIGTERM-happy outer runtime environment (GCP), ones from the outer runner harness to the inner jobs, and the logic in the inner jobs. And alas, the logic in some of the inner jobs has to make fundamentally non-idempotent, state-changing requests over API's to 3P systems that I don't control, and which if terminated leave the 3P system in an indeterminate and undeterminable state. Which means that if the cron job gets terminated in the middle of that API request, the system ends up in an unknown state, and whatever you do to get it into a known state will be wrong (leading to user-visible bad behavior) half the time. And because its final state can't be determined from its own API, and it can't be invoked idempotently, you can't even use a 2-phase commit approach to protect that. But it turns out that signal suppression does actually make this problem go away enough to be manageable in prod. Except that the code now has to be changed from single-threaded to multi-threaded for various other reasons, and so signal suppression by changing the signal handlers and then changing them back no longer works. So that's an example of why you might find yourself in such a situation in userland. And overall, Python's signal handling mechanism is pretty good; it's *way* nicer than having to deal with it in C, since signal handlers run in the main thread as more-or-less ordinary Python code, and you don't have to deal with the equivalent of signal-safety and the like. The downside of that flexibility, though, is that some tasks like deferring signals end up being *really hard* in the Python layer, because even appending the signum to an array isn't atomic enough to guarantee that it won't be interrupted by another signal. On Thu, Jun 25, 2020 at 5:43 PM Bernardo Sulzbach < bernardo@bernardosulzbach.com> wrote:
-- Yonatan Zunger Distinguished Engineer and Chief Ethics Officer He / Him zunger@humu.com 100 View St, Suite 101 Mountain View, CA 94041 Humu.com <https://www.humu.com> · LinkedIn <https://www.linkedin.com/company/humuhq> · Twitter <https://twitter.com/humuinc>

On Thu, 25 Jun 2020 18:32:48 -0700 Yonatan Zunger via Python-ideas <python-ideas@python.org> wrote:
If you want to serialize execution of signal handlers, it seems like the best course of action would be to use signal.set_wakeup_fd() and process incoming signals in an event loop. Other than that, SimpleQueue may help as well, since it's specifically meant to be reentrant: https://docs.python.org/3/library/queue.html#simplequeue-objects Regards Antoine.

Absolutely, but I figured the natural thing to expose from the C API was a very minimal function, and then put a context manager in the Python layer. The actual context manager implementation I would use would be a bit smarter than a bare set/reset -- it would use an unbounded semaphore and only unsuppress when the count dropped to zero. That way, multiple threads can simultaneously have critical regions, and suppression would happen whenever anyone was suppressing. On Thu, Jun 25, 2020 at 1:54 PM MRAB <python@mrabarnett.plus.com> wrote:
-- Yonatan Zunger Distinguished Engineer and Chief Ethics Officer He / Him zunger@humu.com 100 View St, Suite 101 Mountain View, CA 94041 Humu.com <https://www.humu.com> · LinkedIn <https://www.linkedin.com/company/humuhq> · Twitter <https://twitter.com/humuinc>

On Thu, Jun 25, 2020 at 5:09 PM Yonatan Zunger via Python-ideas < python-ideas@python.org> wrote:
I find, for reasons you have already mentioned, having a "suppress all signals" something _really_ strange in userland code. But maybe I just have never seen a case in which it makes sense. Are you sure that the problem isn't "a runtime environment which is quite enthusiastic about sending SIGTERMs"?

Oh, that's definitely part of the problem, but that is *far* beyond my ability to fix. Right now I'm still working on getting its owners to do things like "could you please log somewhere when you kill jobs, and maybe even indicate why the job was killed?". The main time that signals show up in life outside of writing device drivers and the like is when implementing or interacting with runtime environments, basically, as they're the mechanism of interruptive interprocess communication, noncooperative scheduling control, and so on. Typical horrifying example: One of the systems I've do control is the shell that runs cron jobs (not their scheduling, but their actual execution) which needs to provide an outer harness that manages getting commands from the scheduler, integration with all sorts of logging and monitoring systems, etc., and needs to actually execute the Python code of the real jobs inside it. It needs to do various kinds of noncooperative scheduling to those subtasks (timeouts, killing and replacing workers under various circumstances, etc) and so runs them in a subprocess. So I get several layers of signals: incoming ones from the SIGTERM-happy outer runtime environment (GCP), ones from the outer runner harness to the inner jobs, and the logic in the inner jobs. And alas, the logic in some of the inner jobs has to make fundamentally non-idempotent, state-changing requests over API's to 3P systems that I don't control, and which if terminated leave the 3P system in an indeterminate and undeterminable state. Which means that if the cron job gets terminated in the middle of that API request, the system ends up in an unknown state, and whatever you do to get it into a known state will be wrong (leading to user-visible bad behavior) half the time. And because its final state can't be determined from its own API, and it can't be invoked idempotently, you can't even use a 2-phase commit approach to protect that. But it turns out that signal suppression does actually make this problem go away enough to be manageable in prod. Except that the code now has to be changed from single-threaded to multi-threaded for various other reasons, and so signal suppression by changing the signal handlers and then changing them back no longer works. So that's an example of why you might find yourself in such a situation in userland. And overall, Python's signal handling mechanism is pretty good; it's *way* nicer than having to deal with it in C, since signal handlers run in the main thread as more-or-less ordinary Python code, and you don't have to deal with the equivalent of signal-safety and the like. The downside of that flexibility, though, is that some tasks like deferring signals end up being *really hard* in the Python layer, because even appending the signum to an array isn't atomic enough to guarantee that it won't be interrupted by another signal. On Thu, Jun 25, 2020 at 5:43 PM Bernardo Sulzbach < bernardo@bernardosulzbach.com> wrote:
-- Yonatan Zunger Distinguished Engineer and Chief Ethics Officer He / Him zunger@humu.com 100 View St, Suite 101 Mountain View, CA 94041 Humu.com <https://www.humu.com> · LinkedIn <https://www.linkedin.com/company/humuhq> · Twitter <https://twitter.com/humuinc>

On Thu, 25 Jun 2020 18:32:48 -0700 Yonatan Zunger via Python-ideas <python-ideas@python.org> wrote:
If you want to serialize execution of signal handlers, it seems like the best course of action would be to use signal.set_wakeup_fd() and process incoming signals in an event loop. Other than that, SimpleQueue may help as well, since it's specifically meant to be reentrant: https://docs.python.org/3/library/queue.html#simplequeue-objects Regards Antoine.
participants (4)
-
Antoine Pitrou
-
Bernardo Sulzbach
-
MRAB
-
Yonatan Zunger