[ python-Bugs-756924 ] SIGSEGV causes hung threads (Linux)

Tue May 11 18:46:41 EDT 2004

Bugs item #756924, was opened at 2003-06-18 19:28
Message generated for change (Comment added) made by langmead
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=756924&group_id=5470

Category: Threads
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Greg Jones (morngnstar)
Assigned to: Nobody/Anonymous (nobody)
Summary: SIGSEGV causes hung threads (Linux)

Initial Comment:
When a segmentation fault happens on Linux in any 
thread but the main thread, the program exits, but 
zombie threads remain behind.

Steps to reproduce:
1. Download attached tar and extract files zombie.py 
and zombieCmodule.c.
2. Compile and link zombieCmodule.c as a shared library 
(or whatever other method you prefer for making a 
Python extension module).
3. Put the output from step 2 (zombieC.so) in your 
lib/python directory.
4. Run python2.2 zombie.py.
5. After the program exits, run ps.

zombie.py launches several threads that just loop 
forever, and one that calls a C function in zombieC. The 
latter prints "NULL!" then segfaults intentionally, 
printing "Segmentation fault". Then the program exits, 
returning control back to the shell.

Expected, and Python 2.1 behavior:
No Python threads appear in the output of ps.

Actual Python 2.2 behavior:
5 Python threads appear in the output of ps. To kill 
them, you have to apply kill -9 to each one individually.

  Not only does this bug leave around messy zombie 
threads, but the threads left behind hold on to program 
resources. For example, if the program binds a socket, 
that port cannot be bound again until someone kills the 
threads. Of course programs should not generate 
segfaults, but if they do they should fail gracefully.

  I have identified the cause of this bug. The old Python 
2.1 behavior can be restored by removing these lines of 
Python/thread_pthread.h:

sigfillset(&newmask);
SET_THREAD_SIGMASK(SIG_BLOCK, &newmask, 
&oldmask);

... and ...

SET_THREAD_SIGMASK(SIG_SETMASK, &oldmask, NULL);

  I guess even SIGSEGV gets blocked by this code, and 
somehow that prevents the default behavior of segfaults 
from working correctly.

  I'm not suggesting that removing this code is a good 
way to fix this bug. This is just an example to show that 
it seems to be the blocking of signals that causes this 
bug.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-11 18:46

Message:
Logged In: YES 
user_id=119306

I was handwaving a bit over the "arrangements" to make with the 
siglongjump. It is probable that blocking SIGINT from all spawned 
threads will be the easiest. It will also work in both the pthreads and 
LWP case (signal sent to one unblocked thread in the process) and the 
LinuxThreads and SGI threads case (signal broadcast to the process 
group, which includes each thread individually.)  The only thing I wanted 
to double check was whether readline could be executed by any thread 
other than the main thread. If so, the SIGINT handler needs to check not 
whether it is the main thread, but rather if it is the (or *a*?)  thread that 
currently is in the middle of a readline call.

----------------------------------------------------------------------

Comment By: Jason Lowe (jasonlowe)
Date: 2004-05-11 15:57

Message:
Logged In: YES 
user_id=56897

Ack.  I thought I was logged in for that previous comment
which was from me (jasonlowe).

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-05-11 15:52

Message:
Logged In: NO 

I agree, the original patch I submitted is horribly
ham-fisted because it blocks all signals.  I'm kicking
myself for not forseeing the problems with SIGSEGV, SIGTERM,
etc.  as reported in 756924.

The original problem I was trying to fix was that the wrong
thread (i.e.: any thread but the main thread) would receive
the SIGINT and end up doing the longjmp() to the context
saved by the main thread.  Then we have two threads
executing on the main thread's stack which is a Bad Thing. 
With the way the readline support currently handles SIGINT
via setjmp()/longjmp(), you really want the main thread and
only the main thread to get the SIGINT and perform that
longjmp().

Would it be reasonable to block only SIGINT (and not other
signals) for all threads but the main thread?  That would
force SIGINT to be handled by the main thread and eliminate
the worry that the wrong thread will do the longjmp() into
the main thread's context in the readline code.

I agree with large parts of langmead's proposed approach to
fixing this, but I do have concerns about the combination of
these two parts:

* Remove the section of thread creation that blocks signals.

* Change readline.c to use more thread safe constructs where 
available (arrange so that the longjmp out of the signal handler
is only executed for the thread that is using readline, and use 
siglongjmp if available)

According to the book "Programming with Threads" by
Kleinman, Shah, and Smaalders:

"Asynchronously generated signals are sent to the process as
a whole where they may be serviced by any thread that has
the signal unmasked.  If more than one thread is able to
receive a signal sent to the process, only one is chosen."

If we leave SIGINT unmasked on all threads, then the signal
handler will need to check the thread ID, and if not the
main thread, use pthread_kill(main_thread, SIGINT) to defer
the work to the main thread. In that sense, it'd be simpler
to block SIGINT in all threads and force the system to route
the SIGINT to the main thread directly.  Of course if a
particular threads implementation doesn't have the desired
asynchronous signal routing behavior, maybe leaving SIGINT
unmasked and using the pthread_kill(main_thread, SIGINT)
technique could work around that.

So to sum up, I'm in complete agreement with unblocking most
if not all signals in other threads and with langmead's
proposals to leverage the benefits provided by sigaction()
and siglongjmp() when possible.  I have one question though.
 Would it be reasonable to force SIGINT to be handled only
by the main thread, or is there a need for Python threads
other than the main thread to handle/receive SIGINT?  If the
latter then the setjmp()/longjmp() mechanism currently used
in the readline module is going to be problematic.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-11 09:59

Message:
Logged In: YES 
user_id=6380

I'm beginning to think that langmead may be on to something:
that blocking all signals in all threads is Just Plain Wrong
(tm). The Zope SIGSEGV problem is just an example; I have my
own beef with SIGTERM, which ends up blocked (together with
all other signals) in child processes started from a thread.

I would love to see langmead's patch! (For Python 2.4.)

Make sure to test the behavior from 465673 (and possible
219772?) after removing the signal blocking but before
adding the new fixes, and again after applying those, to
make sure 465673 is really gone.

Also, I'd like to hear from jasonlowe, who submitted bug
465673 and the patch that caused all the problems, 468347.
Maybe his signal-fu has increased since 2001.

It would be a miracle to get this into 2.3.4 though...

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2004-05-11 04:52

Message:
Logged In: YES 
user_id=6656

That does indeed sound reasonable, but not for 2.3.4
(professional cowardice, I'm afraid).

Good luck!

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-10 18:46

Message:
Logged In: YES 
user_id=119306

The original bug that added the signal blocking, #465673, seems 
to be exposing itself via a combination of  threads and readline. Is 
it possible that it is the problem is there and not within the signal 
handling code itself? (especially since it installs and removes a 
SIGINT handler, possibly causing a race condition with the code 
within the signal handler when it re-installs itself. On systems that 
have sigaction, should python need to re-install handlers at all? )

I'm tempted to try to the following, and if it works submit a patch.  
Does this seem like it would be the right direction?

* Remove the section of thread creation that blocks signals.

* Look for sections of code may have reentrancy issues, like:

**  On machines with reliable signals, keep the signal handler 
installed, rather than reinstalling it within the handler.

** Change Py_AddPendingCall to use a real semaphore, if 
available, rather than a busy flag.

** Change readline.c to use more thread safe constructs where 
available (arrange so that the longjmp out of the signal handler is 
only executed for the thread that is using readline, and use 
siglongjmp if available)

and then see if issues like this one are solved without 
reintroducing issues from 465673.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-10 16:06

Message:
Logged In: YES 
user_id=31435

Unassigned (was assigned to Guido, but doesn't sound like 
he's going to do more with it).

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-10 15:47

Message:
Logged In: YES 
user_id=6380

I agree with Anthony, too much risk for 2.3.4.

I don't claim to understand this code any more; in particular 
the signal blocking code that's currently there wasn't written 
by me and if I checked it in, I did it hoping for the best...

Langmead is right about signal asynchrony.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-10 14:45

Message:
Logged In: YES 
user_id=119306

Unfortunately, in pthreads the "synchronous" doesn't apply to a 
signal number, but its method of delivery. You can deliver a 
"SIGSEGV" asynchronously with the "kill" command, and you send 
normally asynchronous signals with  pthread_kill. What <http://
sourceforge.net/tracker/
?func=detail&aid=949332&group_id=5470&atid=305470> does is 
unblock signals like SIGSEGV which are likely to be sent 
synchronously from the OS and are unlikely to be handled by 
normal processes as asynchronous handlers.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2004-05-10 12:08

Message:
Logged In: YES 
user_id=29957

I'd strongly prefer that this go into the trunk, and sooner,
rather than later. I'd even more strongly prefer that this
not go anywhere near the release23-maint branch, at least
until _after_ 2.3.4 is done. If there ends up being a nice
easy way to do this, great! We can cut a 2.3.5 around the
same time as 2.4 final. 

Putting this into 2.3.4 seems, to me, to be a hell of a risk.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-09 21:13

Message:
Logged In: YES 
user_id=31435

Whether you're off the hook depends on whether you're 
determined to be <wink>.  I don't run on Unixish systems, so 
I'm not a lot of use here.

The problem you've had with unkillable subprocesses also 
affects Zope, and you'll recall that the zdaemon business tries 
to keep Zope sites running via signal cruft.  As people have 
tried to move from Python 2.1 to Python 2.3, they're 
discovering that Zope sites fail hard because of the signal-
blocking added after 2.1:  "when an external python module 
segfaults during a zope request ... the remaining worker 
threads are deadlocked", from

http://tinyurl.com/2qslw

and zdaemon doesn't do its job then.

Andrew has in mind a scheme for not blocking "synchronous" 
signals, which makes sense to me, but I'm out of touch with 
this stuff.  If you can't review it, who can?  It would sure be 
nice to get a resolution into 2.3.4, although I understand that 
may be perceived as too risky.  The alternative from my 
immediate POV is that people running Zope-based apps on 
some Unixish systems stay away from Python 2.3, which is a 
real shame.  For that matter, unless this is resolved, I 
suppose they'll stay away from Python 2.4 too.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-07 17:26

Message:
Logged In: YES 
user_id=6380

(You had me confused for a bit -- I thought you meant Python
2.3, but you meant file revision 2.3, which was in 1994...)

It can't be as simple as that; the 1994 code (rev 2.3)
initializes both main_pid and main_thread, and checks one or
the other in different places. The NOTES in that version
don't shed much light on the issue except claiming that
checking getpid() is a hack that works on three platforms
named: SGI, Solaris, and POSIX threads.

The code that is re-initializing main_pid in
PyOS_AfterFork()was added much later (rev 2.30, in 1997).

Here's my theory.

On SGI IRIX, like on pre-2.6-kernel-Linux, getpid() differs
per thread, and SIGINT is sent to each thread. The getpid()
test here does the right thing: it only sets the flag once
(in the handler in the main thread).

On Solaris, the getpid() test is a no-op, which is fine sice
only one thread gets the signal handler.

Those were the only two cases that the getpid() test really
cared for; the NOTES section speculated that it would also
work with POSIX threads if the signal was only delivered to
the main thread.

Conclusion: the getpid() test was *not* a mistake, and
replacing it with a get_thread_ident() test is not the right
answer.

But the getpid() test is probably not correct for all
pthreads implementations, and some fix may be necessary.

I also agree that blocking all signals is too aggressive,
but am not sure what to do about this either. (It has caused
some problems in my own code where I was spawning a
subprocess in a thread, and the subprocess inherited the
blocked signals, causing it to be unkillable except through
SIGKILL.)

Am I off the hook now?

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-07 12:50

Message:
Logged In: YES 
user_id=31435

Assigned to Guido to get an answer to one of the questions 
here:  Guido, signal_handler() checks getpid() against 
main_pid, and has ever since revision 2.3 (when you first 
taught signalmodule.c about threads).  But on every pthreads 
box except for Linux, get_pid() should always equal main_pid 
(even after a fork).  What was the intent?  I read the 
comments the same as Andrew does here, that the intent 
was to check thread identity, not process identity.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-07 09:59

Message:
Logged In: YES 
user_id=119306

mwh wrote: "when there's a modern, actually working implementation of 
pthreads, I don't think we actually need to block signals at all."

The bug report that caused the patch to be created was originally 
reported on Solaris, which has a more correct pthreads implementation. 
I'm now wondering if that problem was not caused by signals being 
handled by the spawned threads, but rather that the signal handler does 
a check for "if (getpid() == main_pid)" rather than 
"(PyThread_get_thread_ident() == main_thread)". One a standard's 
compliant pthreads implementation, and even on Solaris, getpid() will 
always "==" "main_pid".

For the Linux case, we may have a more modern working threads 
implementation now, but when the old LinuxThreads style behavior was 
out and deployed for 8 years or so, it will probably be around for a 
while.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-07 09:48

Message:
Logged In: YES 
user_id=119306

There are two different thread related patches that I submitted,

I agree that 
<http://sourceforge.net/tracker/?
func=detail&aid=948614&group_id=5470&atid=305470> is pretty radical.  
(Its the one that tests at configure time for LinuxThreads peculiarities 
and alters the thread spawning and signal related activities accordingly.)

A different related signal patch
<http://sourceforge.net/tracker/?
func=detail&aid=949332&group_id=5470&atid=305470> might be more 
appealing to you. It only unblocks signals like segmentation faults that 
creates synchronously sends to itself and that a pthreads implementation 
will always send to the faulting thread. (whether it blocks it or not.) 

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2004-05-07 09:06

Message:
Logged In: YES 
user_id=29957

Any patches in this area, I'd prefer to see on the trunk,
along with tests to exercise it (and confirm that it's not
breaking something else). We can then give it a solid
testing during the 2.4 release cycle. 

I don't want to have to stretch the bugfix release cycle out
to have alphas, betas and the like. This seems like huge
piles of no-fun.

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2004-05-07 08:56

Message:
Logged In: YES 
user_id=6656

Note that there is an attempt at a configure test in 948614,
but it seems very LinuxThreads specific.

I agree with Anthony that this area is very scary.  The last
thing we want to do a fortnight before release is break
things somewhere they currently work.

On the gripping hand, when there's a modern, actually
working implementation of pthreads, I don't think we
actually need to block signals at all.  I certainly don't
have the threads-fu to come up with appropriate
configure/pyport.h magic though.  I'm not sure I have the
energy to test a patch on all the testdrive, snake farm and
SF compile farm machines either.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2004-05-07 08:39

Message:
Logged In: YES 
user_id=29957

We're a week out from release-candidate, and this seems (to
me) to be an area that's fraught with risk. The terms
"HP/UX" and "threads" have also cropped up, which, for me,
is a marker of "here be sodding great big dragons". 

I don't mind delaying the release if it's necessary, and
there's a definite path to getting a nice clean fix in that
won't break things for some other class of platform. This
stuff looks like being a beast to test for, though. 

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-06 16:05

Message:
Logged In: YES 
user_id=31435

Boosting priority, hoping to attract interest before 2.3.4.  
Patch 949332 looks relevant.

----------------------------------------------------------------------

Comment By: Kjetil Jacobsen (kjetilja)
Date: 2004-05-05 04:28

Message:
Logged In: YES 
user_id=5685

I've experienced similar behaviour with hung threads on
other platforms such as HP/UX, so we should consider letting
through some signals to all threads on all platforms.

For instance, very few apps use signal handlers for SIGILL,
SIGFPE, SIGSEGV, SIGBUS and SIGABRT, so unblocking those
signals should not cause much breakage compared to the
breakage caused by blocking all signals.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-04 10:44

Message:
Logged In: YES 
user_id=31435

Noting that this has become a semi-frequent topic on the 
zope-dev mailing list, most recently in the "Segfault and 
Deadlock" thread starting here:

<http://mail.zope.org/pipermail/zope-dev/2004-
May/022813.html>

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-04 10:00

Message:
Logged In: YES 
user_id=119306

The issue is that the threading implementation in Linux kernels 
previous to 2.6 diverged from the pthreads standard for signal 
handling. Normally signals are sent to the process and can be 
handled by any thread. In the LinuxThreads implementation of 
pthreads, signals are sent to a specific thread. If that thread 
blocks signals (which is what happens to all threads spawned in 
Python 2.2) then those signals do not get routed to a thread with 
them unblocked (what Python calls the "main thread")

The new threading facility in Linux 2.6, the NPTL, does not have 
this signal handling bug.

A simple python script that shows the problem is included below. 
This will hang in Linux kernels before 2.6 or RedHat customized 
kernels before RH9.

#!/usr/bin/python

import signal
import thread
import os

def handle_signals(sig, frame): pass
def send_signals(): os.kill(os.getpid(), signal.SIGSEGV)

signal.signal(signal.SIGSEGV, handle_signals)
thread.start_new_thread(send_signals, ())
signal.pause()

----------------------------------------------------------------------

Comment By: Greg Jones (morngnstar)
Date: 2003-06-18 19:54

Message:
Logged In: YES 
user_id=554883

Related to Bug #756940.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=756924&group_id=5470