[Python-Dev] Status of thread cancellation

Nick Maclaren nmm1 at cus.cam.ac.uk
Mon Mar 19 15:37:49 CET 2007


Grrk.  I have done this myself, and been involved in one of the VERY
few commercial projects that attempted to do it properly (IBM CEL,
the other recent one being VMS).  I am afraid that there are a lot
of misapprehensions here.

Several people have said things like:

> The thing to model this on, I think, would be the
> BSD sigmask mechanism, which lets you selectively
> block certain signals to create a critical section
> of code. A context manager could be used to make
> its use easier and less error-prone (i.e. harder
> to block async exceptions and then forget to unblock
> them).

No, no, no!  That is an TRULY horrible!  It works fairly well for
things like device drivers, which are both structurally simple and
with no higher level recovery mechanism, so that a failure turning
into a hard hang is not catastrophic.  But it is precisely what you
DON'T want for complex applications, especially when a thread may
need to call an external service 'non-interruptibly'.

Think of updating a complex object in a multi-file database, for
example.  Interrupting half-way through leaves the database in a
mess, but blocking interrupts while (possibly remote) file updates
complete is asking for a hang.  You also see it in horrible GUI
(including raw mode text) programs that won't accept interrupts
until you have completed the action they think you have started.
One of the major advantages of networked systems is that you can
usually log in remotely and kill -9 the damn process!



The way that I, IBM and DEC approached it was by the classic
callback mechanism, with a carefully designed way of promoting
unhandled exceptions/interrupts.  For example, the following is
roughly what I did (somewhat extended, as I didn't do all of this
for all exceptions):

An event set a defined flag, which could be tested (and cleared) by
the thread.  If a second, similar event arrived (or it was not
handled after a suitable time), the event was escalated.

If so, a handler was called that HAD to return (again within a
specific time).  If a second, similar event arrived or it didn't
return by a suitable time, the event was escalated.

If so, another handler was called that COULDN'T return.  If another
event arrived, it returned, or it failed to close down the thread,
the event was escalated.

If so, the thread's built-in environment was closed down without
giving the thread a chance to intervene.  If that failed, the event
was escalated.

If so, the thread was frozen and process termination started.  If
clean termination failed, the event was escalated.

If so, the run-time system produced a dump and killed itself.



You can implement a BSD-style ignore by having an initial handler
that just clears the flag and returns, but a third interrupt before
it does so will force close-down.  There was also a facility to
escalate an exception at the point of generation, which could be
useful for forcible closedown.

There are a zillion variations of the above, but all mainframe
experience is that callbacks are the only sane way to approach the
problem IN APPLICATIONS.  In kernel code, that is not so, which is
why so many of the computer scientists design BSD-style handling
(i.e. they think of kernel programming rather than very complex
application programming).



> Unconditionally killing a whole process is no big
> problem because all the resources it's using get
> cleaned up by the OS, and the effect on other
> processes is minimal and well-defined (pipes and
> sockets get EOF, etc.). But killing a thread can
> leave the rest of the program in an awkward state.

I wish that were so :-(

Sockets, terminals etc. are stateful devices, and killing a process
can leave them in a very unclean state.  It is one of the most
common causes of unkillable processes (the process can't go until
its files do, and the socket is jammed).  Many people can witness
the horrible effects of ptys being left in 'echo off' or worse
states, the X focus being left in a stuck override redirect window
and so on.

But you also have the multi-file database problem, which also applies
to shared memory segments.  Even if the process dies cleanly, it may
be part of an application whose state is global across many processes.
One common example is adding or deleting a user, where an unclean
kill can leave the system in a very weird state.


Regards,
Nick Maclaren,
University of Cambridge Computing Service,
New Museums Site, Pembroke Street, Cambridge CB2 3QH, England.
Email:  nmm1 at cam.ac.uk
Tel.:  +44 1223 334761    Fax:  +44 1223 334679


More information about the Python-Dev mailing list