Hi guys, Britton, Sam and Stephen have all reported to me at different times that it seems sometimes one of the processors in a parallel job hangs for a while then races to catch up at the end. Have any of you ever successfully done any localization of this problem? Figuring out where exactly it hangs? I think this would show up in per-processor profiling, and looking to see which functions take up the most time on processors, and disparities in that across procs. I'd *really* like to track this down, as it's now causing us some real problems. Thanks! -Matt
Yo,
Britton, Sam and Stephen have all reported to me at different times that it seems sometimes one of the processors in a parallel job hangs
I should add that the current problem I'm seeing and have been discussing with Matt appears to happen on all processors, not just one or the root. I may not have made that clear to him. But it may be related. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
I'm glad to hear you're looking into this and I'm very interested to know
what's going on. My impression from watching this in action was simply that
the head process seemed not so much to be hung up on something, but simply
going much slower than the other processes until the time they finished, at
which point it would speed up. Parallel projections were where I saw this
the most. I have a feeling the effect was more noticeable the more
processes you have.
One way I knew it was simply going slow and not actually hung up was that,
when the other processes finished, I would be left with the progress bar of
the route process, with an expected time of completion that was reading vary
large. However, it would quickly wind back down to something more
reasonable as it would be speeding back up.
Sorry that isn't much help. Good luck with this. I'd really love to know
what's going on here.
On Thu, Sep 24, 2009 at 11:35 AM, Stephen Skory
Yo,
Britton, Sam and Stephen have all reported to me at different times that it seems sometimes one of the processors in a parallel job hangs
I should add that the current problem I'm seeing and have been discussing with Matt appears to happen on all processors, not just one or the root. I may not have made that clear to him. But it may be related.
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ http://physics.ucsd.edu/%7Esskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________ _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Britton,
Thanks for your information. My suspicion is that there might be
something going on with the IO. I've committed a change (r1460) that
reduces IO on non-root processors -- and also warns if your loglevel
threshold is too low -- and hopefully this will help with things. In
retrospect, there really WAS a lot of IO on the processors, and the
progressbars only made it worse!
If you get a chance to test this, let me know!
-Matt
On Fri, Sep 25, 2009 at 8:30 AM, Britton Smith
I'm glad to hear you're looking into this and I'm very interested to know what's going on. My impression from watching this in action was simply that the head process seemed not so much to be hung up on something, but simply going much slower than the other processes until the time they finished, at which point it would speed up. Parallel projections were where I saw this the most. I have a feeling the effect was more noticeable the more processes you have.
One way I knew it was simply going slow and not actually hung up was that, when the other processes finished, I would be left with the progress bar of the route process, with an expected time of completion that was reading vary large. However, it would quickly wind back down to something more reasonable as it would be speeding back up.
Sorry that isn't much help. Good luck with this. I'd really love to know what's going on here.
On Thu, Sep 24, 2009 at 11:35 AM, Stephen Skory
wrote: Yo,
Britton, Sam and Stephen have all reported to me at different times that it seems sometimes one of the processors in a parallel job hangs
I should add that the current problem I'm seeing and have been discussing with Matt appears to happen on all processors, not just one or the root. I may not have made that clear to him. But it may be related.
_______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________ _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
participants (3)
-
Britton Smith
-
Matthew Turk
-
Stephen Skory