[Python-Dev] Problem with signals in a single threaded application

Sat Jan 27 11:12:39 CET 2007

I apologise for going off-topic, but this is an explanation of why
I said that signal handling is not reliable.  The only relevance to
Python is that Python should avoid relying on signals if possible,
and try to be a little defensive if not.  Signals will USUALLY do
what is expected, but not always :-(

Anything further by Email, please.

Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> 
> > This one looks like an oversight in Python code, and so is a bug,
> > but it is important to note that signals do NOT work reliably under
> > any Unix or Microsoft system.
> 
> That's a rather pessimistic way of putting it. In my
> experience, signals in Unix mostly do what they're
> meant to do quite reliably -- it's just a matter of
> understanding what they're meant to do.

Yes, it is pessimistic, but I am afraid that my experience is that
it is so :-(  That doesn't deny your point that they MOSTLY do
'work', but car drivers MOSTLY don't need to wear seat belts, either.
I am talking about high-RAS objectives, and ones where very rare
failure modes can become common (e.g. HPC and other specialist uses).

More commonly, there are plain bugs in the implementations which are
sanctioned by the standards (Linux is relatively disdainful of such
legalistic games).  Because they say that everything is undefined
behaviour, many vendors' support mechanisms will refuse to accept
bug reports unless you push like hell.  And, as some are DIABOLICALLY
difficult to explain, let alone demonstrate, they can remain lurking
for years or decades.

> There may be bugs in certain systems that cause
> signals to get lost under obscure circumstances, but
> that's no reason for Python to make the situation
> worse by introducing bugs of its own.

100% agreed.

> > Two related signals received between two 'checkpoints' (i.e. when
> > the signal is tested and cleared).  You may only get one of them,
> > and 'related' does not mean 'the same'.
> 
> I wasn't aware that this could happen between
> different signals. If it can, there must be some
> rationale as to why the second signal is considered
> redundant. Otherwise there's a bug in either the
> design or the implementation.

Nope.  There is often a clash between POSIX and the hardware, or
a cause where a 'superior' signal overrides an 'inferior' one.
I have seen SIGKILL flush some other signals, for example.  And, on
some systems, SIGFPE may be divided into the basic hardware exceptions.
If you catch SIGFPE as such, all of those may be cleared.  I don't
think that many (any?) current systems do that.

And it is actually specified to occur for the SISSTOP, SIGTSTP,
SIGTTIN, SIGTTOU, SIGCONT group.

> > A second signal received while the first is being 'handled' by the
> > operating system or language run-time system.
> 
> That one sounds odd to me. I would expect a signal
> received during the execution of a handler to be
> flagged and cause the handler to be called again
> after it returns. But then I'm used to the BSD
> signal model, which is relatively sane.

It's nothing to do with the BSD model, which may be saner but still
isn't 100% reliable, but occurs at a lower layer.  At the VERY lowest
level, when a genuine hardware event causes an interrupt, the FLIH
(first-level interrupt handler) runs in God mode (EVERYTHING disabled)
until it classifies what is going on.  This is a ubiquitous misdesign
of modern hardware, but that is off-topic.  Hardware 'signals' from
other CPUs/devices may well get lost if they occur in that window.

And there are other, but less extreme, causes at higher levels in the
operating system.  Unix and Microsoft do NOT have a reliable signal
delivery model, where the sender of a signal checks if the recipient
has got it and retries if not.  Some operating systems do - but I don't
think that BSD does.

> > A signal sent while the operating system is doing certain things to
> > the application (including, sometimes, when it is swapped out or
> > deep in I/O.)
> 
> That sounds like an outright bug. I can't think
> of any earthly reason why the handler shouldn't
> be called eventually, if it remains installed and
> the process lives long enough.

See above.  It gets lost at a low level.  That is why you can cause
serious time drift on an "IBM PC" (most modern ones) by hammering
the video card or generating streams of floating-point fixups.  Most
people don't notice, because xntp or equivalent fixes it up.

And there are worse problems.  I could start on cross-CPU TLB and ECC
handling on large shared memory systems.  I managed to get an Origin
in a state where it wouldn't even power down from the power-off button,
and I had to flip breakers, due to THAT one!  I have reason to believe
that all largish SMP systems have similar problems.

Again, it is possible to design an operating system to avoid those
issues, but we are talking about mainstream ones, and they don't.

Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679