[Twisted-Python] Deadlocks when launching processes - how to investigate?
Hello, I just filed https://twistedmatrix.com/trac/ticket/6972 The issue I'm facing is a deadlocked Python on OS X when a lot of processes are spawned. In the repro script we do this very aggressively to trigger the deadlock quickly, but the actual program that does this "ticks" every minute. There is a possibility that this is either a Python bug or an OS X issue as the same program used to run fine in 10.5 and after some upgrades to 10.7 this issue appeared. We worked around it by using 2.6 but now we need 2.7. I know this is going to be difficult for people to reproduce, so I wonder if someone can help me investigate the issue further. I found this https://dev.launchpad.net/Debugging/GDB but it doesn't work - I believe, without being able to confirm, that the issue is that GDB can't really work with clang-built executables? Or perhaps I don't have the debugging symbols. Any help would be greatly appreciated. Orestis
Hi Orestis, Since creating processes involves syscalls, I would expect the tool of choice to debug this on OS X to be dtruss. I wish I could help on this endeavor but right now I'm focusing on fighting e-mail. (Brace yourself. PyCon is coming.) hth lvh
On 02/14/2014 07:21 AM, Orestis Markou wrote:
Hello,
I just filed https://twistedmatrix.com/trac/ticket/6972
The issue I'm facing is a deadlocked Python on OS X when a lot of processes are spawned. In the repro script we do this very aggressively to trigger the deadlock quickly, but the actual program that does this "ticks" every minute.
There is a possibility that this is either a Python bug or an OS X issue as the same program used to run fine in 10.5 and after some upgrades to 10.7 this issue appeared. We worked around it by using 2.6 but now we need 2.7.
I know this is going to be difficult for people to reproduce, so I wonder if someone can help me investigate the issue further. I found this https://dev.launchpad.net/Debugging/GDB but it doesn't work - I believe, without being able to confirm, that the issue is that GDB can't really work with clang-built executables? Or perhaps I don't have the debugging symbols.
When debugging Python deadlocks in general: 1. Try https://pypi.python.org/pypi/faulthandler/ - send appropriate signal when process deadlocks. 2. If that doesn't work, I have had good luck debugging at least one Python mystery freeze with GDB. In particular because it has built-in Python support (sometimes), you to actually get a Python traceback. This assumes access to debugging symbols, though. lldb may have similar functionality, maybe Googling can help with that and finding debug symbols. In this particular case, the traceback plus some googling (http://bugs.python.org/issue11768 is what I found, presumably a different bug though) suggests the bug may be something like signal handler not being re-entrant for some reason and you're getting SIGCHLD just in the C code handling SIGCHLD. Try disabling SIGCHLD and just calling "twisted.internet.process.reapAllProcesses()" a few times a second and see if that's a good workaround - if so, add a note to the bug. If that is the case you may be able to reproduce the bug by setting a SIGCHLD handler and then sending SIGCHLD to the process a lot, no Twisted involved.
On Feb 15, 2014, at 6:38 AM, Itamar Turner-Trauring <itamar@itamarst.org> wrote:
In this particular case, the traceback plus some googling (http://bugs.python.org/issue11768 is what I found, presumably a different bug though) suggests the bug may be something like signal handler not being re-entrant for some reason and you're getting SIGCHLD just in the C code handling SIGCHLD. Try disabling SIGCHLD and just calling "twisted.internet.process.reapAllProcesses()" a few times a second and see if that's a good workaround - if so, add a note to the bug. If that is the case you may be able to reproduce the bug by setting a SIGCHLD handler and then sending SIGCHLD to the process a lot, no Twisted involved.
This was also my reading of the stack trace. Thanks for finding the reference in the Python bug tracker. The one thing that confused me was that the sample program appeared to be running the program only once a second, and waiting for it to complete before running it again. So how would the signal handler be re-entrant? Perhaps 'pmset' runs a subprocess of its own so that the parent process receives two SIGCHLDs? It looks like this fix might have been included in 2.7.6, since it was fixed on the 2.7 branch. Has it been? -glyph
On 15 Feb, 07:58 pm, glyph@twistedmatrix.com wrote:
The one thing that confused me was that the sample program appeared to be running the program only once a second, and waiting for it to complete before running it again.
I think it's more like 81 processes once a second and *not* waiting for them to complete before starting over. Notice the lack of yields in key places. I suspect inlineCallbacks has gradually eaten out the part of your brain that recognizes that keyword. ;) Jean-Paul
On Feb 15, 2014, at 5:32 PM, exarkun@twistedmatrix.com wrote:
On 15 Feb, 07:58 pm, glyph@twistedmatrix.com wrote:
The one thing that confused me was that the sample program appeared to be running the program only once a second, and waiting for it to complete before running it again.
I think it's more like 81 processes once a second and *not* waiting for them to complete before starting over. Notice the lack of yields in key places. I suspect inlineCallbacks has gradually eaten out the part of your brain that recognizes that keyword. ;)
Well, I was skimming, and I saw the 'yield's in the things that *were* decorated with @inlineCallbacks; in this case it was the "plain" code that tricked me :-). -g
On Feb 15, 2014, at 5:32 PM, exarkun@twistedmatrix.com wrote:
On 15 Feb, 07:58 pm, glyph@twistedmatrix.com wrote:
The one thing that confused me was that the sample program appeared to be running the program only once a second, and waiting for it to complete before running it again.
I think it's more like 81 processes once a second and *not* waiting for them to complete before starting over. Notice the lack of yields in key places. I suspect inlineCallbacks has gradually eaten out the part of your brain that recognizes that keyword. ;)
I sort of noted this on the ticket, but I think the idea of using KQueue to address this would be great. Is there a similar thing we might be able to do on Linux to get rid of the dependence on a SIGCHLD handler? -g
On 18 Feb, 07:29 pm, glyph@twistedmatrix.com wrote:
On Feb 15, 2014, at 5:32 PM, exarkun@twistedmatrix.com wrote:
On 15 Feb, 07:58 pm, glyph@twistedmatrix.com wrote:
The one thing that confused me was that the sample program appeared to be running the program only once a second, and waiting for it to complete before running it again.
I think it's more like 81 processes once a second and *not* waiting for them to complete before starting over. Notice the lack of yields in key places. I suspect inlineCallbacks has gradually eaten out the part of your brain that recognizes that keyword. ;)
I sort of noted this on the ticket, but I think the idea of using KQueue to address this would be great. Is there a similar thing we might be able to do on Linux to get rid of the dependence on a SIGCHLD handler?
Not that I know of. As far as I know Linux is missing good child process event notification support. Jean-Paul
On Feb 18, 2014, at 6:07 PM, exarkun@twistedmatrix.com wrote:
On 18 Feb, 07:29 pm, glyph@twistedmatrix.com wrote:
On Feb 15, 2014, at 5:32 PM, exarkun@twistedmatrix.com wrote:
On 15 Feb, 07:58 pm, glyph@twistedmatrix.com wrote:
The one thing that confused me was that the sample program appeared to be running the program only once a second, and waiting for it to complete before running it again.
I think it's more like 81 processes once a second and *not* waiting for them to complete before starting over. Notice the lack of yields in key places. I suspect inlineCallbacks has gradually eaten out the part of your brain that recognizes that keyword. ;)
I sort of noted this on the ticket, but I think the idea of using KQueue to address this would be great. Is there a similar thing we might be able to do on Linux to get rid of the dependence on a SIGCHLD handler?
Not that I know of. As far as I know Linux is missing good child process event notification support.
Wait... what about... signalfd?! Lots of mentions of SIGCHLD here: <http://linux.die.net/man/2/signalfd> -g
On 02:50 am, glyph@twistedmatrix.com wrote:
On Feb 18, 2014, at 6:07 PM, exarkun@twistedmatrix.com wrote:
On 18 Feb, 07:29 pm, glyph@twistedmatrix.com wrote:
On Feb 15, 2014, at 5:32 PM, exarkun@twistedmatrix.com wrote:
On 15 Feb, 07:58 pm, glyph@twistedmatrix.com wrote:
The one thing that confused me was that the sample program appeared to be running the program only once a second, and waiting for it to complete before running it again.
I think it's more like 81 processes once a second and *not* waiting for them to complete before starting over. Notice the lack of yields in key places. I suspect inlineCallbacks has gradually eaten out the part of your brain that recognizes that keyword. ;)
I sort of noted this on the ticket, but I think the idea of using KQueue to address this would be great. Is there a similar thing we might be able to do on Linux to get rid of the dependence on a SIGCHLD handler?
Not that I know of. As far as I know Linux is missing good child process event notification support.
Wait... what about... signalfd?!
Lots of mentions of SIGCHLD here:
It's sort of problematic for a library. You have to block the signal for signalfd to work reliably. Now no other code that depends on SIGCHLD can run properly in that process. We could simply declare that to be the case and go ahead... But based on past experience I'm sure some people would be unhappy with that. Perhaps it could be an option only turned on by applications that know it's okay for them and that want to avoid the mess of signals? On the other hand we seem to be dealing with the mess of signals on Linux pretty okay at the moment. Who would opt in to this and why? Jean-Paul
participants (6)
-
exarkun@twistedmatrix.com
-
Glyph
-
Glyph Lefkowitz
-
Itamar Turner-Trauring
-
Laurens Van Houtven
-
Orestis Markou