Debugging parallel objects?
Hi all, I have time series script that iterates over a bunch of simulation outputs, does some analysis, and then dumps a pickle with the results of the analysis. This script works correctly when I loop over the outputs like so: for pf in ts: do stuff But when I use the piter functionality to loop over the outputs in parallel: for sto,pf in ts.piter(storage=sto): do the same stuff the script will sometimes hang when I run on more than one processor. It's difficult to find exactly where and why it's hanging since the run is distributed - I'd like to be able to reproduce the error to track down why it's happening. I'm curious if anyone has any tips for debugging parallel operations in yt. I'm not very familiar with the internals of the parallel_objects machinery, so likely places to check or put breakpoints would be very helpful. Also, are there parallel debugging tools for python? Thanks for your help, Nathan
Hey Nathan, On Wed, Oct 3, 2012 at 4:48 PM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
Hi all,
I have time series script that iterates over a bunch of simulation outputs, does some analysis, and then dumps a pickle with the results of the analysis. This script works correctly when I loop over the outputs like so:
for pf in ts: do stuff
But when I use the piter functionality to loop over the outputs in parallel:
for sto,pf in ts.piter(storage=sto): do the same stuff
the script will sometimes hang when I run on more than one processor. It's difficult to find exactly where and why it's hanging since the run is distributed - I'd like to be able to reproduce the error to track down why it's happening.
My guess is that there's a problem with multi-level parallelism, or with processors not having any work.
I'm curious if anyone has any tips for debugging parallel operations in yt. I'm not very familiar with the internals of the parallel_objects machinery, so likely places to check or put breakpoints would be very helpful. Also, are there parallel debugging tools for python?
Okay, there are a couple things that help, that come with yt that I put in when debugging the initial parallel code a couple years ago. Run with --rpdb , which will spawn an XML-RPC "remote" pdb that can respond to most commands. Then kill with SIGUSR2 which I think will throw a runtime error when it gets caught by yt. This puts all the processors into the (*remarkably* dangerous) state of waiting for pdb commands. Now run: yt rpdb -t WHATEVER where WHATEVER is 0 .. NPROC-1. This has to be run on the same host as the process you want to pdb. This will put you into pdb mode. You can issue "shutdown" which will shut down a single process, but it's probably easier to kill the mpirun task when you've figured it out. For a simpler way to inspect where processes are, kill with SIGUSR1. This will cause a stack trace to print. Good luck -- and if you can figure out how to repeat it simply, send that along too. -Matt
Thanks for your help,
Nathan _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
participants (2)
-
Matthew Turk
-
Nathan Goldbaum