Strange segfault in Python threads and linux kernel 2.6
data:image/s3,"s3://crabby-images/3961c/3961c9f9e8186080e8f56c319ef5e95b56a6a3b9" alt=""
G'day, I've Cc'ed this to zope-coders as it might affect other Zope developers and it had me stumped for ages. I couldn't find anything on it anywhere, so I figured it would be good to get something into google :-). We are developing a Zope2.7 application on Debian GNU/Linux that is using fop to generate pdf's from xml-fo data. fop is a java thing, and we are using popen2.Popen3(), non-blocking mode, and select loop to write/read stdin/stdout/stderr. This was all working fine. Then over the Christmas chaos, various things on my development system were apt-get updated, and I noticed that java/fop had started segfaulting. I tried running fop with the exact same input data from the command line; it worked. I wrote a python script that invoked fop in exactly the same way as we were invoking it inside zope; it worked. It only segfaulted when invoked inside Zope. I googled and tried everything... switched from j2re1.4 to kaffe, rolled back to a previous version of python, re-built Zope, upgraded Zope from 2.7.2 to 2.7.4, nothing helped. Then I went back from a linux 2.6.8 kernel to a 2.4.27 kernel; it worked! After googling around, I found references to recent attempts to resolve some signal handling problems in Python threads. There was one post that mentioned subtle differences between how Linux 2.4 and Linux 2.6 did signals to threads. So it seems this is a problem with Python threads and Linux kernel 2.6. The attached program demonstrates that it has nothing to do with Zope. Using it to run "fop-test /usr/bin/fop </dev/null" on a Debian box with fop installed will show the segfault. Running the same thing on a machine with 2.4 kernel will instead get the fop "usage" message. It is not a generic fop/java problem with 2.6 because the commented un-threaded line works fine. It doesn't seem to segfault for any command... "cat -" works OK, so it must be something about java contributing. After searching the Python bugs, the closest I could find was #971213 <http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=971213>. Is this the same bug? Should I submit a new bug report? Is there any other way I can help resolve this? BTW, built in file objects really could use better non-blocking support... I've got a half-drafted PEP for it... anyone interested in it? -- Donovan Baarda <abo@minkirri.apana.org.au> http://minkirri.apana.org.au/~abo/
data:image/s3,"s3://crabby-images/4c5e0/4c5e094efaa72edc3f091be11b2a2b05a33dd2b6" alt=""
Donovan Baarda <abo@minkirri.apana.org.au> writes:
G'day,
I've Cc'ed this to zope-coders as it might affect other Zope developers and it had me stumped for ages. I couldn't find anything on it anywhere, so I figured it would be good to get something into google :-).
We are developing a Zope2.7 application on Debian GNU/Linux that is using fop to generate pdf's from xml-fo data. fop is a java thing, and we are using popen2.Popen3(), non-blocking mode, and select loop to write/read stdin/stdout/stderr. This was all working fine.
Then over the Christmas chaos, various things on my development system were apt-get updated, and I noticed that java/fop had started segfaulting. I tried running fop with the exact same input data from the command line; it worked. I wrote a python script that invoked fop in exactly the same way as we were invoking it inside zope; it worked. It only segfaulted when invoked inside Zope.
I googled and tried everything... switched from j2re1.4 to kaffe, rolled back to a previous version of python, re-built Zope, upgraded Zope from 2.7.2 to 2.7.4, nothing helped. Then I went back from a linux 2.6.8 kernel to a 2.4.27 kernel; it worked!
After googling around, I found references to recent attempts to resolve some signal handling problems in Python threads. There was one post that mentioned subtle differences between how Linux 2.4 and Linux 2.6 did signals to threads.
You've left out a very important piece of information: which version of Python you are using. I'm guessing 2.3.4. Can you try 2.4?
So it seems this is a problem with Python threads and Linux kernel 2.6. The attached program demonstrates that it has nothing to do with Zope. Using it to run "fop-test /usr/bin/fop </dev/null" on a Debian box with fop installed will show the segfault. Running the same thing on a machine with 2.4 kernel will instead get the fop "usage" message. It is not a generic fop/java problem with 2.6 because the commented un-threaded line works fine. It doesn't seem to segfault for any command... "cat -" works OK, so it must be something about java contributing.
After searching the Python bugs, the closest I could find was #971213 <http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=971213>. Is this the same bug? Should I submit a new bug report? Is there any other way I can help resolve this?
I'd be astonished if this is the same bug. The main oddness about python threads (before 2.3) is that they run with all signals masked. You could play with a C wrapper (call setprocmask, then exec fop) to see if this is what is causing the problem. But please try 2.4.
BTW, built in file objects really could use better non-blocking support... I've got a half-drafted PEP for it... anyone interested in it?
Err, this probably should be in a different mail :) Cheers, mwh -- If trees could scream, would we be so cavalier about cutting them down? We might, if they screamed all the time, for no good reason. -- Jack Handey
data:image/s3,"s3://crabby-images/3961c/3961c9f9e8186080e8f56c319ef5e95b56a6a3b9" alt=""
On Wed, 2005-01-19 at 13:37 +0000, Michael Hudson wrote:
Donovan Baarda <abo@minkirri.apana.org.au> writes: [...] You've left out a very important piece of information: which version of Python you are using. I'm guessing 2.3.4. Can you try 2.4?
Debian Python2.3 (2.3.4-18), Debian kernel-image-2.6.8-1-686 (2.6.8-10), and Debian kernel-image-2.4.27-1-686 (2.4.27-6)
I'd be astonished if this is the same bug.
The main oddness about python threads (before 2.3) is that they run with all signals masked. You could play with a C wrapper (call setprocmask, then exec fop) to see if this is what is causing the problem. But please try 2.4.
Python 2.4 does indeed fix the problem. Unfortunately we are using Zope 2.7.4, and I'm a bit wary of attempting to migrate it all from 2.3 to 2.4. Is there any way this "Fix" can be back-ported to 2.3? Note that this problem is being triggered when using Popen3() in a thread. Popen3() simply uses os.fork() and os.execvp(). The segfault is occurring in the excecvp'ed process. I'm sure there must be plenty of cases where this could happen. I think most people manage to avoid it because the processes they are popen'ing or exec'ing happen to not use signals. After testing a bit, it seems the fork() in Popen3 is not a contributing factor. The problem occurs whenever os.execvp() is executed in a thread. It looks like the exec'ed command inherits the masked signals from the thread. I'm not sure what the correct behaviour should be. The fact that it works in python2.4 feels more like a byproduct of the thread mask change than correct behaviour. To me it seems like execvp() should be setting the signal mask back to defaults or at least the mask of the main process before doing the exec.
BTW, built in file objects really could use better non-blocking support... I've got a half-drafted PEP for it... anyone interested in it?
Err, this probably should be in a different mail :)
The verboseness of the attached test code because of this issue prompted that comment... so vaguely related :-) -- Donovan Baarda <abo@minkirri.apana.org.au> http://minkirri.apana.org.au/~abo/
data:image/s3,"s3://crabby-images/4c5e0/4c5e094efaa72edc3f091be11b2a2b05a33dd2b6" alt=""
Donovan Baarda <abo@minkirri.apana.org.au> writes:
On Wed, 2005-01-19 at 13:37 +0000, Michael Hudson wrote:
Donovan Baarda <abo@minkirri.apana.org.au> writes: [...] You've left out a very important piece of information: which version of Python you are using. I'm guessing 2.3.4. Can you try 2.4?
Debian Python2.3 (2.3.4-18), Debian kernel-image-2.6.8-1-686 (2.6.8-10), and Debian kernel-image-2.4.27-1-686 (2.4.27-6)
I'd be astonished if this is the same bug.
The main oddness about python threads (before 2.3) is that they run with all signals masked. You could play with a C wrapper (call setprocmask, then exec fop) to see if this is what is causing the problem. But please try 2.4.
Python 2.4 does indeed fix the problem.
That's good to hear.
Unfortunately we are using Zope 2.7.4, and I'm a bit wary of attempting to migrate it all from 2.3 to 2.4.
That's not so good to hear, albeit unsurprising.
Is there any way this "Fix" can be back-ported to 2.3?
Probably not. It was quite invasive and a bit scary. OTOH, it hasn't been the cause of any bug reports yet, so it can't be all bad.
Note that this problem is being triggered when using Popen3() in a thread. Popen3() simply uses os.fork() and os.execvp(). The segfault is occurring in the excecvp'ed process. I'm sure there must be plenty of cases where this could happen. I think most people manage to avoid it because the processes they are popen'ing or exec'ing happen to not use signals.
Indeed.
After testing a bit, it seems the fork() in Popen3 is not a contributing factor. The problem occurs whenever os.execvp() is executed in a thread. It looks like the exec'ed command inherits the masked signals from the thread.
Yeah. I could have told you that, sorry :)
I'm not sure what the correct behaviour should be. The fact that it works in python2.4 feels more like a byproduct of the thread mask change than correct behaviour.
Well, getting rid of the thread mask changes was one of the goals of the change.
To me it seems like execvp() should be setting the signal mask back to defaults or at least the mask of the main process before doing the exec.
Possibly. I think the 2.4 change -- not fiddling the process mask at all -- is the Right Thing, but that doesn't help 2.3 users. This has all been discussed before at some length, on python-dev and in various bug reports on SF. In your situation, I think the simplest thing you can do is dig out an old patch of mine that exposes sigprocmask + co to Python and either make a custom Python incorporating the patch and use that, or put the code from the patch into an extension module. Then before execing fop, use the new code to set the signal mask to something sane. Not pretty, particularly, but it should work.
BTW, built in file objects really could use better non-blocking support... I've got a half-drafted PEP for it... anyone interested in it?
Err, this probably should be in a different mail :)
The verboseness of the attached test code because of this issue prompted that comment... so vaguely related :-)
Oh right :) Didn't actually read the test code, not having fop to hand... Cheers, mwh -- The ability to quote is a serviceable substitute for wit. -- W. Somerset Maugham
data:image/s3,"s3://crabby-images/3961c/3961c9f9e8186080e8f56c319ef5e95b56a6a3b9" alt=""
On Thu, 2005-01-20 at 14:12 +0000, Michael Hudson wrote:
Donovan Baarda <abo@minkirri.apana.org.au> writes:
On Wed, 2005-01-19 at 13:37 +0000, Michael Hudson wrote:
Donovan Baarda <abo@minkirri.apana.org.au> writes: [...] The main oddness about python threads (before 2.3) is that they run with all signals masked. You could play with a C wrapper (call setprocmask, then exec fop) to see if this is what is causing the problem. But please try 2.4.
Python 2.4 does indeed fix the problem.
That's good to hear. [...]
I still don't understand what Linux 2.4 vs Linux 2.6 had to do with it. Reading the man pages for execve(), pthread_sigmask() and sigprocmask(), I can see some ambiguities, but mostly only if you do things they warn against (ie, use sigprocmask() instead of pthread_sigmask() in a multi-threaded app). The man page for execve() says that the new process will inherit the "Process signal mask (see sigprocmask() )". This implies to me it will inherit the mask from the main process, not the thread's signal mask. It looks like Linux 2.4 uses the signal mask of the main thread or process for the execve(), whereas Linux 2.6 uses the thread's signal mask. Given that execve() replaces the whole process, including all threads, I dunno if using the thread's mask is right. Could this be a Linux 2.6 kernel bug?
I'm not sure what the correct behaviour should be. The fact that it works in python2.4 feels more like a byproduct of the thread mask change than correct behaviour.
Well, getting rid of the thread mask changes was one of the goals of the change.
I gathered that... which kinda means the fact that it fixed execvp in threads is a side effect...(though I also guess it fixed a lot of other things like this too).
To me it seems like execvp() should be setting the signal mask back to defaults or at least the mask of the main process before doing the exec.
Possibly. I think the 2.4 change -- not fiddling the process mask at all -- is the Right Thing, but that doesn't help 2.3 users. This has all been discussed before at some length, on python-dev and in various bug reports on SF.
Would a simple bug-fix for 2.3 be to have os.execvp() set the mask to something sane before executing C execvp()? Given that Python does not have any visibility of the procmask... This might be a good idea regardless as it will protect against this bug resurfacing in the future if someone decides fiddling with the mask for threads is a good idea again.
In your situation, I think the simplest thing you can do is dig out an old patch of mine that exposes sigprocmask + co to Python and either make a custom Python incorporating the patch and use that, or put the code from the patch into an extension module. Then before execing fop, use the new code to set the signal mask to something sane. Not pretty, particularly, but it should work.
The extension module that exposes sigprocmask() is probably best for now... -- Donovan Baarda <abo@minkirri.apana.org.au> http://minkirri.apana.org.au/~abo/
data:image/s3,"s3://crabby-images/4c5e0/4c5e094efaa72edc3f091be11b2a2b05a33dd2b6" alt=""
Donovan Baarda <abo@minkirri.apana.org.au> writes:
On Thu, 2005-01-20 at 14:12 +0000, Michael Hudson wrote:
Donovan Baarda <abo@minkirri.apana.org.au> writes:
On Wed, 2005-01-19 at 13:37 +0000, Michael Hudson wrote:
Donovan Baarda <abo@minkirri.apana.org.au> writes: [...] The main oddness about python threads (before 2.3) is that they run with all signals masked. You could play with a C wrapper (call setprocmask, then exec fop) to see if this is what is causing the problem. But please try 2.4.
Python 2.4 does indeed fix the problem.
That's good to hear. [...]
I still don't understand what Linux 2.4 vs Linux 2.6 had to do with it.
I have to admit to not being that surprised that behaviour appears somewhat inexplicable. As you probably know, linux 2.6 has a more-or-less entirely different threads implementation (NPTL) than 2.4 (LinuxThreads) -- so changes in behaviour aren't exactly surprising. Whether they were intentional, a good thing, etc, I have a careful lack of opinion :)
Reading the man pages for execve(), pthread_sigmask() and sigprocmask(), I can see some ambiguities, but mostly only if you do things they warn against (ie, use sigprocmask() instead of pthread_sigmask() in a multi-threaded app).
Uh, I don't know how much I'd trust documentation in this situation. Really. Threads and signals are almost inherently incompatible, unfortunately.
The man page for execve() says that the new process will inherit the "Process signal mask (see sigprocmask() )". This implies to me it will inherit the mask from the main process, not the thread's signal mask.
Um. Maybe. But this is the sort of thing I meant above -- if signals are delivered to threads, not processes, what does the "Process signal mask" mean? The signal mask of the thread that executed main()? I guess you could argue that, but I don't know how much I'd bet on it.
It looks like Linux 2.4 uses the signal mask of the main thread or process for the execve(), whereas Linux 2.6 uses the thread's signal mask.
I'm not sure that this is the case -- I'm reasonably sure I saw problems caused by the signal masks before 2.6 was ever released. But I could be wrong.
Given that execve() replaces the whole process, including all threads, I dunno if using the thread's mask is right. Could this be a Linux 2.6 kernel bug?
You could ask, certainly... Although I've done a certain amount of battle with these problems, I don't know what any published standards have to say about these things which is the only real criteria by which it could be called "a bug".
I'm not sure what the correct behaviour should be. The fact that it works in python2.4 feels more like a byproduct of the thread mask change than correct behaviour.
Well, getting rid of the thread mask changes was one of the goals of the change.
I gathered that... which kinda means the fact that it fixed execvp in threads is a side effect...(though I also guess it fixed a lot of other things like this too).
Um. I meant "getting rid of the thread mask" was one of the goals *because* it would fix the problems with execve and system() and friends.
To me it seems like execvp() should be setting the signal mask back to defaults or at least the mask of the main process before doing the exec.
Possibly. I think the 2.4 change -- not fiddling the process mask at all -- is the Right Thing, but that doesn't help 2.3 users. This has all been discussed before at some length, on python-dev and in various bug reports on SF.
Would a simple bug-fix for 2.3 be to have os.execvp() set the mask to something sane before executing C execvp()?
Perhaps. I'm not sure I want to go fiddling there. Maybe someone else does. system(1) presents a problem too, though, which is harder to worm around unless we want to implement it ourselves, in practice.
Given that Python does not have any visibility of the procmask...
This might be a good idea regardless as it will protect against this bug resurfacing in the future if someone decides fiddling with the mask for threads is a good idea again.
In the long run, everyone will use 2.4. There are some other details to the changes in 2.4 that have a slight chance of breaking programs which is why I'm uneasy about putting them in 2.3.5 -- for a bug fix release it's much much worse to break a program that was working than to fail to fix one that wasn't.
In your situation, I think the simplest thing you can do is dig out an old patch of mine that exposes sigprocmask + co to Python and either make a custom Python incorporating the patch and use that, or put the code from the patch into an extension module. Then before execing fop, use the new code to set the signal mask to something sane. Not pretty, particularly, but it should work.
The extension module that exposes sigprocmask() is probably best for now...
I hope it helps! Cheers, mwh -- <etrepum> Jokes around here tend to get followed by implementations. -- from Twisted.Quotes
data:image/s3,"s3://crabby-images/b852d/b852d2fdf6252785afcd5a238aa556675b8ca839" alt=""
On Thursday 20 January 2005 12:43, Donovan Baarda wrote:
On Wed, 2005-01-19 at 13:37 +0000, Michael Hudson wrote:
The main oddness about python threads (before 2.3) is that they run with all signals masked. You could play with a C wrapper (call setprocmask, then exec fop) to see if this is what is causing the problem. But please try 2.4.
Python 2.4 does indeed fix the problem. Unfortunately we are using Zope 2.7.4, and I'm a bit wary of attempting to migrate it all from 2.3 to 2.4. Is there any wa this "Fix" can be back-ported to 2.3?
It's extremely unlikely - I couldn't make myself comfortable with it when attempting to figure out it's backportedness. While the current behaviour on 2.3.4 is broken in some cases, I fear very much that the new behaviour will break other (working) code - and this is something I try very hard to avoid in a bugfix release, particularly in one that's probably the final one of a series. Fundamentally, the answer is "don't do signals+threads, you will get burned". For your application, you might want to instead try something where you write requests to a file in a spool directory, and have a python script that loops looking for requests, and generates responses. This is likely to be much simpler to debug and work with. Anthony -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
data:image/s3,"s3://crabby-images/3961c/3961c9f9e8186080e8f56c319ef5e95b56a6a3b9" alt=""
G'day, From: "Anthony Baxter" <anthony@interlink.com.au>
On Thursday 20 January 2005 12:43, Donovan Baarda wrote:
On Wed, 2005-01-19 at 13:37 +0000, Michael Hudson wrote:
The main oddness about python threads (before 2.3) is that they run with all signals masked. You could play with a C wrapper (call setprocmask, then exec fop) to see if this is what is causing the problem. But please try 2.4.
Python 2.4 does indeed fix the problem. Unfortunately we are using Zope 2.7.4, and I'm a bit wary of attempting to migrate it all from 2.3 to 2.4. Is there any wa this "Fix" can be back-ported to 2.3?
It's extremely unlikely - I couldn't make myself comfortable with it when attempting to figure out it's backportedness. While the current behaviour on 2.3.4 is broken in some cases, I fear very much that the new behaviour will break other (working) code - and this is something I try very hard to avoid in a bugfix release, particularly in one that's probably the final one of a series.
Fundamentally, the answer is "don't do signals+threads, you will get burned". For your application, you might want to instead try
In this case it turns out to be "don't do exec() in a thread, because what you exec can have all it's signals masked". That turns out to be a hell of a lot of things; popen, os.command, etc. They all only work OK in a threaded application if what you are exec'ing doesn't use any signals.
something where you write requests to a file in a spool directory, and have a python script that loops looking for requests, and generates responses. This is likely to be much simpler to debug and work with.
Hmm, interprocess communications; great fun :-) And no spawning the process from within the zope application; it's gotta be a separate daemon. Actually, I've noticed that zope often has a sorta zombie "which" process which it spawns. I wonder it this is a stuck thread waiting for some signal... ---------------------------------------------------------------- Donovan Baarda http://minkirri.apana.org.au/~abo/ ----------------------------------------------------------------
data:image/s3,"s3://crabby-images/b852d/b852d2fdf6252785afcd5a238aa556675b8ca839" alt=""
On Wednesday 26 January 2005 01:01, Donovan Baarda wrote:
In this case it turns out to be "don't do exec() in a thread, because what you exec can have all it's signals masked". That turns out to be a hell of a lot of things; popen, os.command, etc. They all only work OK in a threaded application if what you are exec'ing doesn't use any signals.
Yep. You just have to be aware of it. We do a bit of this at work, and we either spool via a database table, or a directory full of spool files.
Actually, I've noticed that zope often has a sorta zombie "which" process which it spawns. I wonder it this is a stuck thread waiting for some signal...
Quite likely. -- Anthony Baxter <anthony@interlink.com.au> It's never too late to have a happy childhood.
data:image/s3,"s3://crabby-images/3961c/3961c9f9e8186080e8f56c319ef5e95b56a6a3b9" alt=""
On Wed, 2005-01-26 at 01:53 +1100, Anthony Baxter wrote:
On Wednesday 26 January 2005 01:01, Donovan Baarda wrote:
In this case it turns out to be "don't do exec() in a thread, because what you exec can have all it's signals masked". That turns out to be a hell of a lot of things; popen, os.command, etc. They all only work OK in a threaded application if what you are exec'ing doesn't use any signals.
Yep. You just have to be aware of it. We do a bit of this at work, and we either spool via a database table, or a directory full of spool files.
Actually, I've noticed that zope often has a sorta zombie "which" process which it spawns. I wonder it this is a stuck thread waiting for some signal...
Quite likely.
For the record, it seems that the java version also contributes. This problem only occurs when you have the following combination; Linux >=2.6 Python <=2.3 j2re1.4 =1.4.2.01-1 | kaffe 2:1.1.4xxx If you use Linux 2.4, it goes away. If you use Python 2.4 it goes away. If you use j2re1.4.1.01-1 it goes away. For the problem to occur the following combination needs to occur; 1) Linux uses the thread's sigmask instead of the main thread/process sigmask for the exc'ed process (ie, 2.6 does this, 2.4 doesn't). 2) Python needs to screw with the sigmask in threads (python 2.3 does, python 2.4 doesn't). 3) The exec'ed process needs to rely on threads (j2re1.4 1.4.2.01-1 does, j2re1.4 1.4.1.01-1 doesn't). It is hard to find old Debian deb's of j2re1.4 (1.4.1.01-1), and when you do, you will also need the now non-existent j2se-common 1.1 package. I don't know if this qualifies as a potential bug against j2re1.4 1.4.2.01-1. For now my solution is to roll back to the older j2re1.4. -- Donovan Baarda <abo@minkirri.apana.org.au> http://minkirri.apana.org.au/~abo/
participants (3)
-
Anthony Baxter
-
Donovan Baarda
-
Michael Hudson