[ python-Bugs-756924 ] SIGSEGV causes hung threads (Linux)

Thu May 20 14:38:06 EDT 2004

Bugs item #756924, was opened at 2003-06-18 19:28
Message generated for change (Comment added) made by langmead
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=756924&group_id=5470

Category: Threads
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Submitted By: Greg Jones (morngnstar)
Assigned to: Nobody/Anonymous (nobody)
Summary: SIGSEGV causes hung threads (Linux)

Initial Comment:
When a segmentation fault happens on Linux in any 
thread but the main thread, the program exits, but 
zombie threads remain behind.

Steps to reproduce:
1. Download attached tar and extract files zombie.py 
and zombieCmodule.c.
2. Compile and link zombieCmodule.c as a shared library 
(or whatever other method you prefer for making a 
Python extension module).
3. Put the output from step 2 (zombieC.so) in your 
lib/python directory.
4. Run python2.2 zombie.py.
5. After the program exits, run ps.

zombie.py launches several threads that just loop 
forever, and one that calls a C function in zombieC. The 
latter prints "NULL!" then segfaults intentionally, 
printing "Segmentation fault". Then the program exits, 
returning control back to the shell.

Expected, and Python 2.1 behavior:
No Python threads appear in the output of ps.

Actual Python 2.2 behavior:
5 Python threads appear in the output of ps. To kill 
them, you have to apply kill -9 to each one individually.

  Not only does this bug leave around messy zombie 
threads, but the threads left behind hold on to program 
resources. For example, if the program binds a socket, 
that port cannot be bound again until someone kills the 
threads. Of course programs should not generate 
segfaults, but if they do they should fail gracefully.

  I have identified the cause of this bug. The old Python 
2.1 behavior can be restored by removing these lines of 
Python/thread_pthread.h:

sigfillset(&newmask);
SET_THREAD_SIGMASK(SIG_BLOCK, &newmask, 
&oldmask);

... and ...

SET_THREAD_SIGMASK(SIG_SETMASK, &oldmask, NULL);

  I guess even SIGSEGV gets blocked by this code, and 
somehow that prevents the default behavior of segfaults 
from working correctly.

  I'm not suggesting that removing this code is a good 
way to fix this bug. This is just an example to show that 
it seems to be the blocking of signals that causes this 
bug.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-20 14:38

Message:
Logged In: YES 
user_id=119306

The callback interface seems to have been  added in readline 2.1, 
from 1997. There seem to be configure tests in the current 
Modules/readline.c code to search for features new to readline 2.1  
so my current approach would be upping the minimum readline 
version from 2.0 to 2.1. If needed I could test for the callback 
interface and use it if available, but fall back to the readline() 
interface otherwise (and leave the thread and signal handling 
issues in place when used with older readline.)

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2004-05-20 13:02

Message:
Logged In: YES 
user_id=6656

This sounds cool!  The only thing to be aware of is readline
versioning... are these alternate interfaces a recent thing?

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-20 08:16

Message:
Logged In: YES 
user_id=119306

I have an approach to have readline work well with threads while still 
acknowledging KeyboardInterrrupt. Using the alternate readline interface 
of rl_callback_handler_install() and rl_callback_read_char() along with  a 
select(), we can recognize the interrupt signal when select returns EINTR 
and not need the signal handler at all. I just need to try my patch on a 
few more systems and try to put  Anthony at ease.

----------------------------------------------------------------------

Comment By: Jason Lowe (jasonlowe)
Date: 2004-05-13 13:25

Message:
Logged In: YES 
user_id=56897

There didn't seem to be an outcome from the python-dev
discussion regarding system() and pthread_atfork().  The
thread indicates that system() indeed is supposed to call
atfork handlers, so therefore RedHat 9 is violating the
pthread standard in that sense.  (Whether or not they'll fix
it is another issue.)  There's also mention that os.system()
may be changed to not call system() because of the atfork()
problem.  If the changes to avoid system() are implemented,
would the pthread_atfork() approach still be problematic?

As Martin Loewis points out, we could always implement the
signal fixup in the child directly after the fork() if
Python routines are being used to do the fork() in the first
place.  However if we're concerned about native modules that
directly call fork() then it seems our choices are a
pthread_atfork() approach or an approach where SIGINT isn't
blocked.  Without an async-signal-safe way to route a signal
from one thread to another, I don't see how we can leave
SIGINT unblocked in all threads.

Re: Py_AddPendingCall. That approach might work in many
cases, but I assume it doesn't work well when all threads
are currently busy in native modules that are not
well-behaved.  For example, I have two threads: one in
readline() and the other blocked in a native call that, like
readline(), doesn't return control on EINTR.  If the SIGINT
is sent to the readline thread, the signal handler could
check the thread ID and do the longjmp() since we're the
proper thread to do so.  If the SIGINT is sent to the other
thread, the callback added by Py_AddPendingCall() won't
necessarily be processed any time soon because no threads
are going to return control (in a timely manner) to Python.
 To make matters worse, apparently even something as simple
as pthread_self(), which we'd use to get the thread ID,
isn't async-signal-safe on all platforms.  From what I've
read, none of the pthread functions are guaranteed to be
async-signal-safe.  :-(

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-13 12:48

Message:
Logged In: YES 
user_id=119306

pthread_kill(). That is annoying, I have something nearly done 
that used it. I didn't double check the safety of pthread_kill. I saw 
that posix says that kill is safe to call from interrupt handlers and 
went from there.

Can we note that we need a pthread_kill in a call to 
Py_AddPendingCall, and then handle it later?

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2004-05-13 12:25

Message:
Logged In: YES 
user_id=6656

Just to make life more entertaining, pthread_atfork isn't
what you want, either.

http://mail.python.org/pipermail/python-dev/2003-December/041309.html

----------------------------------------------------------------------

Comment By: Jason Lowe (jasonlowe)
Date: 2004-05-13 12:20

Message:
Logged In: YES 
user_id=56897

Argh!  I thought we had a relatively clean solution, but it
appears there's a stumbling block with the pthread_kill()
approach.  pthread_kill() is not guaranteed to be
async-signal-safe, and nuggets found on the net indicate
there's no portable way to redirect a process signal from
one thread to another:

http://groups.google.com/groups?q=pthread_kill+async+safe&start=30&hl=en&lr=&selm=3662B6A8.861984D5%40zko.dec.com&rnum=32

http://www.redhat.com/archives/phil-list/2003-December/msg00049.html

Given that we can't safely call pthread_kill() from the
SIGINT handler directly, there might be another way to solve
our problems with pthread_atfork().  Here's my thinking:

- Block SIGINT in all threads except the main (readline) thread.

- Register via child process handler via pthread_atfork()
that sets the SIGINT action for the child process back to
the default.

Unfortunately this fix isn't localized to the readline
module as desired, but it may solve the problems.  SIGINT
routing will be forced to the readline thread, and child
processes won't have SIGINT blocked, solving bug 756940. 
The IRIX thread signal delivery model (i.e.: broadcast) may
cause problems since SIGINT may be pending when we attempt
to set the action to default.  Having SIGINT pending when
the handler is changed to default would kill the child
process.  Maybe having the child process set the disposition
to ignore and then to default would safely clear any pending
SIGINT signal?

I'll try to run some experiments with the pthread_atfork()
approach soon, but work and home life for me is particularly
busy lately.  Apologies in advance if it takes me a while to
respond or submit patches.

If we're interested in a timely fix, would it be useful to
break up the fix in two stages?  I think we can all agree
that the current approach of blocking ALL signals in created
threads is a Bad Thing.  What if we implement a quick,
partial fix by simply change the existing code to only block
SIGINT?  This should be a two-line change to
thread_pthread.h where "sigemptyset(&newmask);
sigaddset(&newmask, SIGINT);" is used instead of
"sigfillset(&newmask);".  I see this partial fix having a
number of benefits:

- Easy change to make.  No extra stuff to check for in
configure or calls to things that may not exist or work
properly.

- Much less risky than trying to fix all the problems at
once.  The change only opens up signals to threads that
Python-2.1 is already allowing through.

- Should solve the SIGSEGV zombie problem and Guido's
SIGTERM annoyance, although it would still have the problem
reported in bug 756940.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2004-05-12 10:39

Message:
Logged In: YES 
user_id=29957

This seems like a pragmatic and sensible approach to take,
to me. It should probably be tested on the HP/UX boxes
(google for 'HP/UX testdrive')

I particularly like the idea of just putting a test in to
block readline in the non-main thread. It seems the pythonic
approach - since we can't guarantee behaviour that's
anything but sane, it seems like a plan. Or at least make it
issue a warning saying "don't do this" when readline is
invoked from a non-main thread.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-12 10:11

Message:
Logged In: YES 
user_id=6380

Sounds good.  This solves the problem in the readline
module, where it originates.

BTW, if we can simplify things by only allowing readline()
to be called from the main thread, that's fine with me.
Doing console I/O from threads is insane anyway.

We can start by assuming the signal broadcast problem is
restricted to IRIX, and configure appropriately: define a
test symbol for this and in configure, set this when IRIX is
detected.

----------------------------------------------------------------------

Comment By: Jason Lowe (jasonlowe)
Date: 2004-05-12 09:54

Message:
Logged In: YES 
user_id=56897

SIGINT is 'special' because that's the signal behind the
problems reported in bug 465673.  Given the readline
module's setjmp/longjmp mechanism to process SIGINT, we
simply cannot allow one thread to do the setjmp() and
another thread to do the longjmp() when it receives SIGINT.
 Without the setjmp/longjmp stuff, SIGINT is no more special
than any other asynchronous signal like SIGTERM, SIGUSR1,
etc.  It'd be great if we could get the desired behavior for
SIGINT out of the readline module without setjmp/longjmp,
but without help from the readline library I don't see an
easy way to do this.  The readline library insists on
continuing the readline() call after a SIGINT is handled,
and there doesn't appear to be any way to get it to abort
the current readline() call short of modifying the readline
library.

If we're stuck with the setjmp/longjmp mechanism, I think we
can solve the issues regarding readline() being called from
another thread and exec'd processes from threads by using
the pthread_kill() technique mentioned earlier.  The steps
would look something like this:

- Do not block any signals (including SIGINT) in any threads.

- When we initialize the readline module's jmp_buf via
setjmp(), save off the current thread ID.  Probably want to
check for existing ownership of jmp_buf and flag an error if
detected.

- When the readline module SIGINT handler is invoked, check
if the current thread owns jmp_buf.  If we are the owning
thread then execute the longjmp (or siglongjmp).  If we're
not the owning thread, then have the current thread execute
pthread_kill(jmp_buf_owner_thread, SIGINT) and little else.
 This will defer the SIGINT to the only thread that can
really process it correctly.

- Since SIGINT isn't blocked in any thread, processes exec'd
from threads should get the default behavior for SIGINT
rather than having it blocked.

The above algorithm has a race condition on thread
implementations where all threads receive SIGINT.  The race
can cause SIGINT to be processed more than once.  The
jmp_buf owning thread might finish the processing of SIGINT
before another thread starts its processing and re-sends
SIGINT to the jmp_buf owning thread.  If there's a way to
know via configure that we're on a thread implementation
that broadcasts SIGINT, we could #ifdef the code to use
something like the getpid() hack in signalmodule.c to do the
right thing.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-11 19:41

Message:
Logged In: YES 
user_id=119306

The only thing special about SIGINT is that the readline module uses 
PyOS_setsig to set it, and when readline's special SIGINT handler is set, 
it throws all of the careful thread handling in Modules/sigmodule.c:
signal_handler out the window.

Now that I say it out loud, PyOS_setsig some consideration on its own. 

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-11 19:04

Message:
Logged In: YES 
user_id=6380

And I think it is possible to call readline() from any
thread. (Though it would be a problem if multiple threads
were doing this simultaneously :-)

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-11 18:54

Message:
Logged In: YES 
user_id=6380

But if you still block SIGINT (why is SIGINT special?) in
all threads, processes forked from threads will be started
with SIGINT blocked, and that's still wrong.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-11 18:46

Message:
Logged In: YES 
user_id=119306

I was handwaving a bit over the "arrangements" to make with the 
siglongjump. It is probable that blocking SIGINT from all spawned 
threads will be the easiest. It will also work in both the pthreads and 
LWP case (signal sent to one unblocked thread in the process) and the 
LinuxThreads and SGI threads case (signal broadcast to the process 
group, which includes each thread individually.)  The only thing I wanted 
to double check was whether readline could be executed by any thread 
other than the main thread. If so, the SIGINT handler needs to check not 
whether it is the main thread, but rather if it is the (or *a*?)  thread that 
currently is in the middle of a readline call.

----------------------------------------------------------------------

Comment By: Jason Lowe (jasonlowe)
Date: 2004-05-11 15:57

Message:
Logged In: YES 
user_id=56897

Ack.  I thought I was logged in for that previous comment
which was from me (jasonlowe).

----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2004-05-11 15:52

Message:
Logged In: NO 

I agree, the original patch I submitted is horribly
ham-fisted because it blocks all signals.  I'm kicking
myself for not forseeing the problems with SIGSEGV, SIGTERM,
etc.  as reported in 756924.

The original problem I was trying to fix was that the wrong
thread (i.e.: any thread but the main thread) would receive
the SIGINT and end up doing the longjmp() to the context
saved by the main thread.  Then we have two threads
executing on the main thread's stack which is a Bad Thing. 
With the way the readline support currently handles SIGINT
via setjmp()/longjmp(), you really want the main thread and
only the main thread to get the SIGINT and perform that
longjmp().

Would it be reasonable to block only SIGINT (and not other
signals) for all threads but the main thread?  That would
force SIGINT to be handled by the main thread and eliminate
the worry that the wrong thread will do the longjmp() into
the main thread's context in the readline code.

I agree with large parts of langmead's proposed approach to
fixing this, but I do have concerns about the combination of
these two parts:

* Remove the section of thread creation that blocks signals.

* Change readline.c to use more thread safe constructs where 
available (arrange so that the longjmp out of the signal handler
is only executed for the thread that is using readline, and use 
siglongjmp if available)

According to the book "Programming with Threads" by
Kleinman, Shah, and Smaalders:

"Asynchronously generated signals are sent to the process as
a whole where they may be serviced by any thread that has
the signal unmasked.  If more than one thread is able to
receive a signal sent to the process, only one is chosen."

If we leave SIGINT unmasked on all threads, then the signal
handler will need to check the thread ID, and if not the
main thread, use pthread_kill(main_thread, SIGINT) to defer
the work to the main thread. In that sense, it'd be simpler
to block SIGINT in all threads and force the system to route
the SIGINT to the main thread directly.  Of course if a
particular threads implementation doesn't have the desired
asynchronous signal routing behavior, maybe leaving SIGINT
unmasked and using the pthread_kill(main_thread, SIGINT)
technique could work around that.

So to sum up, I'm in complete agreement with unblocking most
if not all signals in other threads and with langmead's
proposals to leverage the benefits provided by sigaction()
and siglongjmp() when possible.  I have one question though.
 Would it be reasonable to force SIGINT to be handled only
by the main thread, or is there a need for Python threads
other than the main thread to handle/receive SIGINT?  If the
latter then the setjmp()/longjmp() mechanism currently used
in the readline module is going to be problematic.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-11 09:59

Message:
Logged In: YES 
user_id=6380

I'm beginning to think that langmead may be on to something:
that blocking all signals in all threads is Just Plain Wrong
(tm). The Zope SIGSEGV problem is just an example; I have my
own beef with SIGTERM, which ends up blocked (together with
all other signals) in child processes started from a thread.

I would love to see langmead's patch! (For Python 2.4.)

Make sure to test the behavior from 465673 (and possible
219772?) after removing the signal blocking but before
adding the new fixes, and again after applying those, to
make sure 465673 is really gone.

Also, I'd like to hear from jasonlowe, who submitted bug
465673 and the patch that caused all the problems, 468347.
Maybe his signal-fu has increased since 2001.

It would be a miracle to get this into 2.3.4 though...

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2004-05-11 04:52

Message:
Logged In: YES 
user_id=6656

That does indeed sound reasonable, but not for 2.3.4
(professional cowardice, I'm afraid).

Good luck!

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-10 18:46

Message:
Logged In: YES 
user_id=119306

The original bug that added the signal blocking, #465673, seems 
to be exposing itself via a combination of  threads and readline. Is 
it possible that it is the problem is there and not within the signal 
handling code itself? (especially since it installs and removes a 
SIGINT handler, possibly causing a race condition with the code 
within the signal handler when it re-installs itself. On systems that 
have sigaction, should python need to re-install handlers at all? )

I'm tempted to try to the following, and if it works submit a patch.  
Does this seem like it would be the right direction?

* Remove the section of thread creation that blocks signals.

* Look for sections of code may have reentrancy issues, like:

**  On machines with reliable signals, keep the signal handler 
installed, rather than reinstalling it within the handler.

** Change Py_AddPendingCall to use a real semaphore, if 
available, rather than a busy flag.

** Change readline.c to use more thread safe constructs where 
available (arrange so that the longjmp out of the signal handler is 
only executed for the thread that is using readline, and use 
siglongjmp if available)

and then see if issues like this one are solved without 
reintroducing issues from 465673.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-10 16:06

Message:
Logged In: YES 
user_id=31435

Unassigned (was assigned to Guido, but doesn't sound like 
he's going to do more with it).

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-10 15:47

Message:
Logged In: YES 
user_id=6380

I agree with Anthony, too much risk for 2.3.4.

I don't claim to understand this code any more; in particular 
the signal blocking code that's currently there wasn't written 
by me and if I checked it in, I did it hoping for the best...

Langmead is right about signal asynchrony.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-10 14:45

Message:
Logged In: YES 
user_id=119306

Unfortunately, in pthreads the "synchronous" doesn't apply to a 
signal number, but its method of delivery. You can deliver a 
"SIGSEGV" asynchronously with the "kill" command, and you send 
normally asynchronous signals with  pthread_kill. What <http://
sourceforge.net/tracker/
?func=detail&aid=949332&group_id=5470&atid=305470> does is 
unblock signals like SIGSEGV which are likely to be sent 
synchronously from the OS and are unlikely to be handled by 
normal processes as asynchronous handlers.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2004-05-10 12:08

Message:
Logged In: YES 
user_id=29957

I'd strongly prefer that this go into the trunk, and sooner,
rather than later. I'd even more strongly prefer that this
not go anywhere near the release23-maint branch, at least
until _after_ 2.3.4 is done. If there ends up being a nice
easy way to do this, great! We can cut a 2.3.5 around the
same time as 2.4 final. 

Putting this into 2.3.4 seems, to me, to be a hell of a risk.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-09 21:13

Message:
Logged In: YES 
user_id=31435

Whether you're off the hook depends on whether you're 
determined to be <wink>.  I don't run on Unixish systems, so 
I'm not a lot of use here.

The problem you've had with unkillable subprocesses also 
affects Zope, and you'll recall that the zdaemon business tries 
to keep Zope sites running via signal cruft.  As people have 
tried to move from Python 2.1 to Python 2.3, they're 
discovering that Zope sites fail hard because of the signal-
blocking added after 2.1:  "when an external python module 
segfaults during a zope request ... the remaining worker 
threads are deadlocked", from

http://tinyurl.com/2qslw

and zdaemon doesn't do its job then.

Andrew has in mind a scheme for not blocking "synchronous" 
signals, which makes sense to me, but I'm out of touch with 
this stuff.  If you can't review it, who can?  It would sure be 
nice to get a resolution into 2.3.4, although I understand that 
may be perceived as too risky.  The alternative from my 
immediate POV is that people running Zope-based apps on 
some Unixish systems stay away from Python 2.3, which is a 
real shame.  For that matter, unless this is resolved, I 
suppose they'll stay away from Python 2.4 too.

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2004-05-07 17:26

Message:
Logged In: YES 
user_id=6380

(You had me confused for a bit -- I thought you meant Python
2.3, but you meant file revision 2.3, which was in 1994...)

It can't be as simple as that; the 1994 code (rev 2.3)
initializes both main_pid and main_thread, and checks one or
the other in different places. The NOTES in that version
don't shed much light on the issue except claiming that
checking getpid() is a hack that works on three platforms
named: SGI, Solaris, and POSIX threads.

The code that is re-initializing main_pid in
PyOS_AfterFork()was added much later (rev 2.30, in 1997).

Here's my theory.

On SGI IRIX, like on pre-2.6-kernel-Linux, getpid() differs
per thread, and SIGINT is sent to each thread. The getpid()
test here does the right thing: it only sets the flag once
(in the handler in the main thread).

On Solaris, the getpid() test is a no-op, which is fine sice
only one thread gets the signal handler.

Those were the only two cases that the getpid() test really
cared for; the NOTES section speculated that it would also
work with POSIX threads if the signal was only delivered to
the main thread.

Conclusion: the getpid() test was *not* a mistake, and
replacing it with a get_thread_ident() test is not the right
answer.

But the getpid() test is probably not correct for all
pthreads implementations, and some fix may be necessary.

I also agree that blocking all signals is too aggressive,
but am not sure what to do about this either. (It has caused
some problems in my own code where I was spawning a
subprocess in a thread, and the subprocess inherited the
blocked signals, causing it to be unkillable except through
SIGKILL.)

Am I off the hook now?

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-07 12:50

Message:
Logged In: YES 
user_id=31435

Assigned to Guido to get an answer to one of the questions 
here:  Guido, signal_handler() checks getpid() against 
main_pid, and has ever since revision 2.3 (when you first 
taught signalmodule.c about threads).  But on every pthreads 
box except for Linux, get_pid() should always equal main_pid 
(even after a fork).  What was the intent?  I read the 
comments the same as Andrew does here, that the intent 
was to check thread identity, not process identity.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-07 09:59

Message:
Logged In: YES 
user_id=119306

mwh wrote: "when there's a modern, actually working implementation of 
pthreads, I don't think we actually need to block signals at all."

The bug report that caused the patch to be created was originally 
reported on Solaris, which has a more correct pthreads implementation. 
I'm now wondering if that problem was not caused by signals being 
handled by the spawned threads, but rather that the signal handler does 
a check for "if (getpid() == main_pid)" rather than 
"(PyThread_get_thread_ident() == main_thread)". One a standard's 
compliant pthreads implementation, and even on Solaris, getpid() will 
always "==" "main_pid".

For the Linux case, we may have a more modern working threads 
implementation now, but when the old LinuxThreads style behavior was 
out and deployed for 8 years or so, it will probably be around for a 
while.

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-07 09:48

Message:
Logged In: YES 
user_id=119306

There are two different thread related patches that I submitted,

I agree that 
<http://sourceforge.net/tracker/?
func=detail&aid=948614&group_id=5470&atid=305470> is pretty radical.  
(Its the one that tests at configure time for LinuxThreads peculiarities 
and alters the thread spawning and signal related activities accordingly.)

A different related signal patch
<http://sourceforge.net/tracker/?
func=detail&aid=949332&group_id=5470&atid=305470> might be more 
appealing to you. It only unblocks signals like segmentation faults that 
creates synchronously sends to itself and that a pthreads implementation 
will always send to the faulting thread. (whether it blocks it or not.) 

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2004-05-07 09:06

Message:
Logged In: YES 
user_id=29957

Any patches in this area, I'd prefer to see on the trunk,
along with tests to exercise it (and confirm that it's not
breaking something else). We can then give it a solid
testing during the 2.4 release cycle. 

I don't want to have to stretch the bugfix release cycle out
to have alphas, betas and the like. This seems like huge
piles of no-fun.

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2004-05-07 08:56

Message:
Logged In: YES 
user_id=6656

Note that there is an attempt at a configure test in 948614,
but it seems very LinuxThreads specific.

I agree with Anthony that this area is very scary.  The last
thing we want to do a fortnight before release is break
things somewhere they currently work.

On the gripping hand, when there's a modern, actually
working implementation of pthreads, I don't think we
actually need to block signals at all.  I certainly don't
have the threads-fu to come up with appropriate
configure/pyport.h magic though.  I'm not sure I have the
energy to test a patch on all the testdrive, snake farm and
SF compile farm machines either.

----------------------------------------------------------------------

Comment By: Anthony Baxter (anthonybaxter)
Date: 2004-05-07 08:39

Message:
Logged In: YES 
user_id=29957

We're a week out from release-candidate, and this seems (to
me) to be an area that's fraught with risk. The terms
"HP/UX" and "threads" have also cropped up, which, for me,
is a marker of "here be sodding great big dragons". 

I don't mind delaying the release if it's necessary, and
there's a definite path to getting a nice clean fix in that
won't break things for some other class of platform. This
stuff looks like being a beast to test for, though. 

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-06 16:05

Message:
Logged In: YES 
user_id=31435

Boosting priority, hoping to attract interest before 2.3.4.  
Patch 949332 looks relevant.

----------------------------------------------------------------------

Comment By: Kjetil Jacobsen (kjetilja)
Date: 2004-05-05 04:28

Message:
Logged In: YES 
user_id=5685

I've experienced similar behaviour with hung threads on
other platforms such as HP/UX, so we should consider letting
through some signals to all threads on all platforms.

For instance, very few apps use signal handlers for SIGILL,
SIGFPE, SIGSEGV, SIGBUS and SIGABRT, so unblocking those
signals should not cause much breakage compared to the
breakage caused by blocking all signals.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2004-05-04 10:44

Message:
Logged In: YES 
user_id=31435

Noting that this has become a semi-frequent topic on the 
zope-dev mailing list, most recently in the "Segfault and 
Deadlock" thread starting here:

<http://mail.zope.org/pipermail/zope-dev/2004-
May/022813.html>

----------------------------------------------------------------------

Comment By: Andrew Langmead (langmead)
Date: 2004-05-04 10:00

Message:
Logged In: YES 
user_id=119306

The issue is that the threading implementation in Linux kernels 
previous to 2.6 diverged from the pthreads standard for signal 
handling. Normally signals are sent to the process and can be 
handled by any thread. In the LinuxThreads implementation of 
pthreads, signals are sent to a specific thread. If that thread 
blocks signals (which is what happens to all threads spawned in 
Python 2.2) then those signals do not get routed to a thread with 
them unblocked (what Python calls the "main thread")

The new threading facility in Linux 2.6, the NPTL, does not have 
this signal handling bug.

A simple python script that shows the problem is included below. 
This will hang in Linux kernels before 2.6 or RedHat customized 
kernels before RH9.

#!/usr/bin/python

import signal
import thread
import os

def handle_signals(sig, frame): pass
def send_signals(): os.kill(os.getpid(), signal.SIGSEGV)

signal.signal(signal.SIGSEGV, handle_signals)
thread.start_new_thread(send_signals, ())
signal.pause()

----------------------------------------------------------------------

Comment By: Greg Jones (morngnstar)
Date: 2003-06-18 19:54

Message:
Logged In: YES 
user_id=554883

Related to Bug #756940.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=756924&group_id=5470