Hi all, Today at a meeting, it was mentioned that perhaps yt is having trouble with parallelism. To everyone out there: how reflective is this of your experience? Is yt okay with parallelism? (Excluding projections, which I have a new engine ready to go on.) -Matt
Yo,
Today at a meeting, it was mentioned that perhaps yt is having trouble with parallelism. To everyone out there: how reflective is this of your experience? Is yt okay with parallelism? (Excluding projections, which I have a new engine ready to go on.)
I think the biggest hurdle to parallelism is the cost of starting up the Python interpreter and loading the modules. It requires too many disk I/O operations to scale well. There is only one way to fix it, which is the various executable-gluing methods. Perhaps we should make a better effort to attempt to make this work on various machines, and document it? As a follow-on, I've tried this various ways on Kraken, but due to the quirks of CNL, I was unsuccessful. Besides this, I don't think it does. Did the comment have a context or qualification? _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Hi Matt, It's not clear what exactly you mean. If you're asking whether we've experienced problems with regular functionality of yt in parallel, the answer is no for me. If you're asking how I feel about how parallelism is implemented in yt, that's another question with a longer answer that I won't bother writing unless that's actually what you mean. Britton On Thu, Aug 19, 2010 at 2:57 PM, Matthew Turk <matthewturk@gmail.com> wrote:
Hi all,
Today at a meeting, it was mentioned that perhaps yt is having trouble with parallelism. To everyone out there: how reflective is this of your experience? Is yt okay with parallelism? (Excluding projections, which I have a new engine ready to go on.)
-Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt, As you know (since we discussed it off-list), I'm the reason for this being mentioned to you. I had some pretty horrible problems with the various incarnations of HOP in yt being excruciatingly slow and consuming huge amounts of memory for a 1024^3 unigrid dataset, to the point where my grad student and I ended up just using P-GroupFinder, the standalone halo finder that comes with week-of-code enzo. Note that when I say "excruciatingly slow" and "consuming huge amounts of memory", I mean that when we used 256 nodes on Ranger, with 2 cores/node (so 512 cores total) for the 1024^3 dataset, it still ran Ranger out of memory, or, alternately, didn't finish in 24 hours. Various permutations of cores per node, total nodes, and wall clock time all resulted in either seg faults or the code running out the wall clock time, to the tune of us wasting half a million CPU hours trying to do halo-finding via yt for this dataset. That's not cool. P-GroupFinder, in comparison, generated the halo catalog for the same dataset in about 10 minutes on 256 processors. The difference in performance is striking, to say the least. We also had seriously problems with the projections taking significantly more time and memory than one might think they should based on my old standalone tools, but this is already being dealt with. Slices seemed to work just fine, and other things like PDFs seem to work fine as well. One reason that I mentioned this to Mike Norman (presumably he is the person who mentioned the yt thing to you) is that when we were at the Teragrid conference a couple of weeks ago, the subject of inline data analysis came up as relating to our planned Blue Waters unigrid and AMR runs. I expressed reservations that the current version of yt would be an effective solution at the scales we need (4096^3 unigrid run, roughly 1024^3 refine-everywhere AMR runs), based on my recent experiences with the code. While I am on the yt-dev mailing list, you know that I'm not actively developing yt (and maybe would be considered a novice user, at best), so I could simply be 100% wrong in my concerns. Maybe we could run some performance tests? I have a 1024^3 unigrid dataset that seems to be yt's White Whale... --Brian On Thu, Aug 19, 2010 at 2:57 PM, Matthew Turk <matthewturk@gmail.com> wrote:
Hi all,
Today at a meeting, it was mentioned that perhaps yt is having trouble with parallelism. To everyone out there: how reflective is this of your experience? Is yt okay with parallelism? (Excluding projections, which I have a new engine ready to go on.)
-Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Brian and Matt, FYI, I have had similar issues with a 1024^3 dataset (unigrid) that I was able to get the parallel hop working eventually, but it is a memory hog and not super fast. It required about the same number of cores and method of running on ranger that Brian describes. Eric On Aug 19, 2010, at 3:39 PM, Brian O'Shea wrote:
Hi Matt,
As you know (since we discussed it off-list), I'm the reason for this being mentioned to you. I had some pretty horrible problems with the various incarnations of HOP in yt being excruciatingly slow and consuming huge amounts of memory for a 1024^3 unigrid dataset, to the point where my grad student and I ended up just using P- GroupFinder, the standalone halo finder that comes with week-of-code enzo. Note that when I say "excruciatingly slow" and "consuming huge amounts of memory", I mean that when we used 256 nodes on Ranger, with 2 cores/node (so 512 cores total) for the 1024^3 dataset, it still ran Ranger out of memory, or, alternately, didn't finish in 24 hours. Various permutations of cores per node, total nodes, and wall clock time all resulted in either seg faults or the code running out the wall clock time, to the tune of us wasting half a million CPU hours trying to do halo-finding via yt for this dataset. That's not cool. P-GroupFinder, in comparison, generated the halo catalog for the same dataset in about 10 minutes on 256 processors. The difference in performance is striking, to say the least.
We also had seriously problems with the projections taking significantly more time and memory than one might think they should based on my old standalone tools, but this is already being dealt with. Slices seemed to work just fine, and other things like PDFs seem to work fine as well.
One reason that I mentioned this to Mike Norman (presumably he is the person who mentioned the yt thing to you) is that when we were at the Teragrid conference a couple of weeks ago, the subject of inline data analysis came up as relating to our planned Blue Waters unigrid and AMR runs. I expressed reservations that the current version of yt would be an effective solution at the scales we need (4096^3 unigrid run, roughly 1024^3 refine-everywhere AMR runs), based on my recent experiences with the code. While I am on the yt- dev mailing list, you know that I'm not actively developing yt (and maybe would be considered a novice user, at best), so I could simply be 100% wrong in my concerns. Maybe we could run some performance tests? I have a 1024^3 unigrid dataset that seems to be yt's White Whale...
--Brian
On Thu, Aug 19, 2010 at 2:57 PM, Matthew Turk <matthewturk@gmail.com> wrote: Hi all,
Today at a meeting, it was mentioned that perhaps yt is having trouble with parallelism. To everyone out there: how reflective is this of your experience? Is yt okay with parallelism? (Excluding projections, which I have a new engine ready to go on.)
-Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Eric Hallman Google Voice: (774) 469-0278 hallman13@gmail.com
Hi Brian & Eric,
As you know (since we discussed it off-list), I'm the reason for this being mentioned to you. I had some pretty horrible problems with the various incarnations of HOP in yt being excruciatingly slow and consuming huge amounts of memory for a 1024^3 unigrid dataset, to the point where my grad student and I
ended up just using P-GroupFinder, the standalone halo finder that comes with week-of-code enzo. Note that when I say "excruciatingly slow" and "consuming huge amounts of memory", I mean that when we used 256 nodes on Ranger, with 2 cores/node (so 512 cores total) for the 1024^3 dataset, it still ran Ranger out
of memory, or, alternately, didn't finish in 24 hours.
A few notes in response: - Recently I ran a 2048^3 dataset on 264 cores that took about 2 hours which averaged about 8.5GB per task with a peak task of 10 GB. Your job is 1/8 the size and should have run, and I don't know why it didn't. - If I wasn't trying to graduate I would have had more time to assist when your student (Brian) asked me for help. I'm sorry so much of your time was wasted. - My tool as a public tool is not any good unless other people can use it too. Clearly I need to do some work on that. - It *does* use much more memory than it needs to, you are right. I know where the problems are, and whoo-boy they are there, but they are not easy to fix. - Speed could be better, but some of this has to do with how HOP itself works. For example, it needs to run the kD tree twice, unlike FOF which needs to only once. The final group building step is a "global" operation, so that's slow as well. On 128^3 particles, (normal) HOP takes about 75 seconds, and FOF about 25. The C HOP and FOF in yt both use the same kD tree, same data I/O methods, so that's a fair ratio of the increased workload. _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Hi Stephen,
As you know (since we discussed it off-list), I'm the reason for this being mentioned to you. I had some pretty horrible problems with the various incarnations of HOP in yt being excruciatingly slow and consuming huge amounts of memory for a 1024^3 unigrid dataset, to the point where my grad student and I
ended up just using P-GroupFinder, the standalone halo finder that comes with week-of-code enzo. Note that when I say "excruciatingly slow" and "consuming huge amounts of memory", I mean that when we used 256 nodes on Ranger, with 2 cores/node (so 512 cores total) for the 1024^3 dataset, it still ran Ranger out of memory, or, alternately, didn't finish in 24 hours.
A few notes in response:
- Recently I ran a 2048^3 dataset on 264 cores that took about 2 hours which averaged about 8.5GB per task with a peak task of 10 GB. Your job is 1/8 the size and should have run, and I don't know why it didn't.
On Ranger, Kraken, or another machine? Regardless, that's far, far less time than it took us to NOT find halos on our dataset. I'd be happy to point you towards this dataset, if you'd like (I may have already done this in an off-list email), so you can try it yourself. I'd be VERY curious to see if you encounter similar problems to us on Ranger and/or Kraken for our 1024^3 dataset.
- If I wasn't trying to graduate I would have had more time to assist when your student (Brian) asked me for help. I'm sorry so much of your time was wasted.
It's more human time than computer time, at this point - we spent a big chunk of the summer simply trying to find the halos in a box, which was meant to be step 1 of the project. Very frustrating for a new grad student.
- My tool as a public tool is not any good unless other people can use it too. Clearly I need to do some work on that.
- It *does* use much more memory than it needs to, you are right. I know where the problems are, and whoo-boy they are there, but they are not easy to fix.
- Speed could be better, but some of this has to do with how HOP itself works. For example, it needs to run the kD tree twice, unlike FOF which needs to only once. The final group building step is a "global" operation, so that's slow as well. On 128^3 particles, (normal) HOP takes about 75 seconds, and FOF about 25. The C HOP and FOF in yt both use the same kD tree, same data I/O methods, so that's a fair ratio of the increased workload.
This is interesting, and puzzling. We have a 256^3 version of the simulation that I was talking about earlier, and saw numbers that would be comparable to those you mention above. Scaled up to a much larger calculation, however, it took way longer than one might think based on a back-of-the-envelope estimate. Again, I really do think that, once you finish your thesis, it'd potentially be very useful for you to take a look at our dataset. It may simply be that our very small box is pathological in some way compared to the simulations you've been testing on. --Brian
participants (5)
-
Brian O'Shea
-
Britton Smith
-
Eric Hallman
-
Matthew Turk
-
Stephen Skory