Hi all, I implemented this "quadtree extension" that duplicates the quadtree on all processors, which may make it nicer to scale projections. Previously the procedure was: 1) Locally project 2) Merge across procs: 2a) Serialize quadtree 2b) Point-to-point communciate 2c) Deserialize 2d) Merge local and remote 2d) Repeat up to 2a 3) Finish I've added a step 0) which is "initialize entire quadtree", which means all of step 2 becomes "perform sum of big array on all procs." This has good and bad elements: we're still doing a lot of heavy communication across processors, but it will be managed by the MPI implementation instead of by yt. Also, we avoid all of the costly serialize/deserialize procedures. So for a given dataset, step 0 will be fixed in cost, but step 1 will be reduced as the number of processors goes up. Step 2, which now is a single (or two) communication steps, will increase in cost with increasing number of processors. So, it's not clear that this will *actually* be helpful or not. It needs testing, and I've pushed it here: bb://MatthewTurk/yt/ hash 3f39eb7bf468 If anybody out there could test it, I'd be might glad. This is the script I've been using: http://paste.yt-project.org/show/2343/ I'd *greatly* appreciate testing results -- particularly for proc combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results are somewhat inconclusive. Keep in mind you'll have to run with the option: --config serialize=False to get real results. Here's the shell command I used: ( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log Comparison against results from the old method would also be super helpful. The alternate idea that I'd had was a bit different, but harder to implement, and also with a glaring problem. The idea would be to serialize arrays, do the butterfly reduction, but instead of converting into data objects simply progressively walk hilbert indices. Unfortunately this only works for up to 2^32 effective size, which is not going to work in a lot of cases. Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas. :) -Matt
Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids
using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I
got fairly good scaling and made a quick mercurial repo on bitbucket with
everything except the dataset needed to do a similar study.
https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing:
projects/quad_proj_scale:more perf.dat
64 2.444e+01
32 4.834e+01
16 7.364e+01
8 1.125e+02
4 1.853e+02
2 3.198e+02
1 6.370e+02
A few notes:
-- I ran with 64 cores first, then again so that the disks were somewhat
warmed up, then only used the second timing of the 64 core run.
-- While I did get full nodes, the machine doesn't have a ton of I/O nodes
so in an ideal setting performance may be even better.
-- My guess would be that a lot of this speedup comes from having a
parallel filesystem, so you may not get as great of speedups on your laptop.
-- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk
Hi all,
I implemented this "quadtree extension" that duplicates the quadtree on all processors, which may make it nicer to scale projections. Previously the procedure was:
1) Locally project 2) Merge across procs: 2a) Serialize quadtree 2b) Point-to-point communciate 2c) Deserialize 2d) Merge local and remote 2d) Repeat up to 2a 3) Finish
I've added a step 0) which is "initialize entire quadtree", which means all of step 2 becomes "perform sum of big array on all procs." This has good and bad elements: we're still doing a lot of heavy communication across processors, but it will be managed by the MPI implementation instead of by yt. Also, we avoid all of the costly serialize/deserialize procedures. So for a given dataset, step 0 will be fixed in cost, but step 1 will be reduced as the number of processors goes up. Step 2, which now is a single (or two) communication steps, will increase in cost with increasing number of processors.
So, it's not clear that this will *actually* be helpful or not. It needs testing, and I've pushed it here:
bb://MatthewTurk/yt/ hash 3f39eb7bf468
If anybody out there could test it, I'd be might glad. This is the script I've been using:
http://paste.yt-project.org/show/2343/
I'd *greatly* appreciate testing results -- particularly for proc combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results are somewhat inconclusive. Keep in mind you'll have to run with the option:
--config serialize=False
to get real results. Here's the shell command I used:
( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
Comparison against results from the old method would also be super helpful.
The alternate idea that I'd had was a bit different, but harder to implement, and also with a glaring problem. The idea would be to serialize arrays, do the butterfly reduction, but instead of converting into data objects simply progressively walk hilbert indices. Unfortunately this only works for up to 2^32 effective size, which is not going to work in a lot of cases.
Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas. :)
-Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Meant to include the scaling image.
On Thu, May 3, 2012 at 4:44 PM, Sam Skillman
Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I got fairly good scaling and made a quick mercurial repo on bitbucket with everything except the dataset needed to do a similar study. https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing: projects/quad_proj_scale:more perf.dat 64 2.444e+01 32 4.834e+01 16 7.364e+01 8 1.125e+02 4 1.853e+02 2 3.198e+02 1 6.370e+02
A few notes: -- I ran with 64 cores first, then again so that the disks were somewhat warmed up, then only used the second timing of the 64 core run. -- While I did get full nodes, the machine doesn't have a ton of I/O nodes so in an ideal setting performance may be even better. -- My guess would be that a lot of this speedup comes from having a parallel filesystem, so you may not get as great of speedups on your laptop. -- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk
wrote: Hi all,
I implemented this "quadtree extension" that duplicates the quadtree on all processors, which may make it nicer to scale projections. Previously the procedure was:
1) Locally project 2) Merge across procs: 2a) Serialize quadtree 2b) Point-to-point communciate 2c) Deserialize 2d) Merge local and remote 2d) Repeat up to 2a 3) Finish
I've added a step 0) which is "initialize entire quadtree", which means all of step 2 becomes "perform sum of big array on all procs." This has good and bad elements: we're still doing a lot of heavy communication across processors, but it will be managed by the MPI implementation instead of by yt. Also, we avoid all of the costly serialize/deserialize procedures. So for a given dataset, step 0 will be fixed in cost, but step 1 will be reduced as the number of processors goes up. Step 2, which now is a single (or two) communication steps, will increase in cost with increasing number of processors.
So, it's not clear that this will *actually* be helpful or not. It needs testing, and I've pushed it here:
bb://MatthewTurk/yt/ hash 3f39eb7bf468
If anybody out there could test it, I'd be might glad. This is the script I've been using:
http://paste.yt-project.org/show/2343/
I'd *greatly* appreciate testing results -- particularly for proc combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results are somewhat inconclusive. Keep in mind you'll have to run with the option:
--config serialize=False
to get real results. Here's the shell command I used:
( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
Comparison against results from the old method would also be super helpful.
The alternate idea that I'd had was a bit different, but harder to implement, and also with a glaring problem. The idea would be to serialize arrays, do the butterfly reduction, but instead of converting into data objects simply progressively walk hilbert indices. Unfortunately this only works for up to 2^32 effective size, which is not going to work in a lot of cases.
Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas. :)
-Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Sam,
Thanks a ton. This looks good to me, seeing as how at few tasks we
have the overhead of creating the tree, and at many tasks we'll have
collective operations. I'll try to get ahold of another testing
machine and then I'll issue a PR. (And close Issue #348!)
-Matt
On Thu, May 3, 2012 at 6:47 PM, Sam Skillman
Meant to include the scaling image.
On Thu, May 3, 2012 at 4:44 PM, Sam Skillman
wrote: Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I got fairly good scaling and made a quick mercurial repo on bitbucket with everything except the dataset needed to do a similar study. https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing: projects/quad_proj_scale:more perf.dat 64 2.444e+01 32 4.834e+01 16 7.364e+01 8 1.125e+02 4 1.853e+02 2 3.198e+02 1 6.370e+02
A few notes: -- I ran with 64 cores first, then again so that the disks were somewhat warmed up, then only used the second timing of the 64 core run. -- While I did get full nodes, the machine doesn't have a ton of I/O nodes so in an ideal setting performance may be even better. -- My guess would be that a lot of this speedup comes from having a parallel filesystem, so you may not get as great of speedups on your laptop. -- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk
wrote: Hi all,
I implemented this "quadtree extension" that duplicates the quadtree on all processors, which may make it nicer to scale projections. Previously the procedure was:
1) Locally project 2) Merge across procs: 2a) Serialize quadtree 2b) Point-to-point communciate 2c) Deserialize 2d) Merge local and remote 2d) Repeat up to 2a 3) Finish
I've added a step 0) which is "initialize entire quadtree", which means all of step 2 becomes "perform sum of big array on all procs." This has good and bad elements: we're still doing a lot of heavy communication across processors, but it will be managed by the MPI implementation instead of by yt. Also, we avoid all of the costly serialize/deserialize procedures. So for a given dataset, step 0 will be fixed in cost, but step 1 will be reduced as the number of processors goes up. Step 2, which now is a single (or two) communication steps, will increase in cost with increasing number of processors.
So, it's not clear that this will *actually* be helpful or not. It needs testing, and I've pushed it here:
bb://MatthewTurk/yt/ hash 3f39eb7bf468
If anybody out there could test it, I'd be might glad. This is the script I've been using:
http://paste.yt-project.org/show/2343/
I'd *greatly* appreciate testing results -- particularly for proc combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results are somewhat inconclusive. Keep in mind you'll have to run with the option:
--config serialize=False
to get real results. Here's the shell command I used:
( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
Comparison against results from the old method would also be super helpful.
The alternate idea that I'd had was a bit different, but harder to implement, and also with a glaring problem. The idea would be to serialize arrays, do the butterfly reduction, but instead of converting into data objects simply progressively walk hilbert indices. Unfortunately this only works for up to 2^32 effective size, which is not going to work in a lot of cases.
Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas. :)
-Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi all, I just did a scaling test on Pleiades at NASA Ames and got somewhat worse scaling at high processor counts. This is with one Matt's datasets so that might be the issue. Here's some summary data for the hierarchy: http://paste.yt-project.org/show/2348/ I actually found superlinear scaling going from 1 processor to 2 processors so I made two different scaling plots. I think the second plot (assuming the 2 core run is representative of the true serial performance) is probably more accurate. http://imgur.com/5pQ2P http://imgur.com/pOKty Since I was running these jobs interactively, I was able to get a pretty good feel for which parts of the calculation were most time-consuming. As the plots above show, beyond 8 processors, the projection operation was so fast that increasing the processor count really didn't help much. Most of the overhead is in parsing the hierarchy, setting up the MPI communicators, and communicating and assembling the projection on all processors at the end, in rough order of importance. This is also quite memory intensive - each core (independent of the global number of cores) needed at least 4.5 gigabytes of memory. Nathan Goldbaum Graduate Student Astronomy & Astrophysics, UCSC goldbaum@ucolick.org http://www.ucolick.org/~goldbaum On May 4, 2012, at 4:20 AM, Matthew Turk wrote:
Hi Sam,
Thanks a ton. This looks good to me, seeing as how at few tasks we have the overhead of creating the tree, and at many tasks we'll have collective operations. I'll try to get ahold of another testing machine and then I'll issue a PR. (And close Issue #348!)
-Matt
On Thu, May 3, 2012 at 6:47 PM, Sam Skillman
wrote: Meant to include the scaling image.
On Thu, May 3, 2012 at 4:44 PM, Sam Skillman
wrote: Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I got fairly good scaling and made a quick mercurial repo on bitbucket with everything except the dataset needed to do a similar study. https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing: projects/quad_proj_scale:more perf.dat 64 2.444e+01 32 4.834e+01 16 7.364e+01 8 1.125e+02 4 1.853e+02 2 3.198e+02 1 6.370e+02
A few notes: -- I ran with 64 cores first, then again so that the disks were somewhat warmed up, then only used the second timing of the 64 core run. -- While I did get full nodes, the machine doesn't have a ton of I/O nodes so in an ideal setting performance may be even better. -- My guess would be that a lot of this speedup comes from having a parallel filesystem, so you may not get as great of speedups on your laptop. -- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk
wrote: Hi all,
I implemented this "quadtree extension" that duplicates the quadtree on all processors, which may make it nicer to scale projections. Previously the procedure was:
1) Locally project 2) Merge across procs: 2a) Serialize quadtree 2b) Point-to-point communciate 2c) Deserialize 2d) Merge local and remote 2d) Repeat up to 2a 3) Finish
I've added a step 0) which is "initialize entire quadtree", which means all of step 2 becomes "perform sum of big array on all procs." This has good and bad elements: we're still doing a lot of heavy communication across processors, but it will be managed by the MPI implementation instead of by yt. Also, we avoid all of the costly serialize/deserialize procedures. So for a given dataset, step 0 will be fixed in cost, but step 1 will be reduced as the number of processors goes up. Step 2, which now is a single (or two) communication steps, will increase in cost with increasing number of processors.
So, it's not clear that this will *actually* be helpful or not. It needs testing, and I've pushed it here:
bb://MatthewTurk/yt/ hash 3f39eb7bf468
If anybody out there could test it, I'd be might glad. This is the script I've been using:
http://paste.yt-project.org/show/2343/
I'd *greatly* appreciate testing results -- particularly for proc combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results are somewhat inconclusive. Keep in mind you'll have to run with the option:
--config serialize=False
to get real results. Here's the shell command I used:
( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
Comparison against results from the old method would also be super helpful.
The alternate idea that I'd had was a bit different, but harder to implement, and also with a glaring problem. The idea would be to serialize arrays, do the butterfly reduction, but instead of converting into data objects simply progressively walk hilbert indices. Unfortunately this only works for up to 2^32 effective size, which is not going to work in a lot of cases.
Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas. :)
-Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
!DSPAM:10175,4fa3c167802953112396!
Hi everyone,
Sam and I also had an off-list discussion a bit about this, and we saw
similar but not identical results to what Nathan has reported.
(Specifically, Sam sees a lot less memory overhead than Nathan does.)
Sorry for dropping this over the weekend and bringing it back up now;
I had a lot going on the last couple days. I think the key point in
what Nathan writes is that the QT gets to be fast enough to be
overwhelmed by overhead. Also, it's not clear to me if the numbers
Nathan is reporting include hierarchy instantiation or not.
This leads to a couple things --
* How can overhead be reduced? This includes both the time to
generate grid objects in memory as well as the memory of those grid
objects and the affiliated hierarchy. Parsing the file off disk is a
one-time affair, so I am substantially less concerned about that.
* Where are the weak spots in the projection algorithm? At some
point, we need to do a global reduce. The mechanism that I've
identified here should feature the least amount of memory allocation,
but it could perhaps be improved in other ways.
* What are the low-hanging fruits of optimization inside the existing
algorithm?
There are a few operations that I already know of problems in (2D
Profiles is a big one), which are slated to be dealt with in the 2.5
cycle.
A while back for the Blue Waters project I wrote up a little bit of
infrastructure to do profiling of time spent in various routines. I'm
going to take time to resurrect this, and add in memory management
profiling as well. Following this, I'll run this regularly to inspect
the current state of the code. What I'd really like to ask for is
help with identifying routines that may need to be inspected in
detail, and trying to figure out how we can speed them up. I'm not
interested in ceding performance territory, but unfortunately this is
an area we haven't really ever discussed openly on this list.
So:
What routines are slow?
-Matt
On Fri, May 4, 2012 at 8:43 PM, Nathan Goldbaum
Hi all,
I just did a scaling test on Pleiades at NASA Ames and got somewhat worse scaling at high processor counts. This is with one Matt's datasets so that might be the issue.
Here's some summary data for the hierarchy: http://paste.yt-project.org/show/2348/
I actually found superlinear scaling going from 1 processor to 2 processors so I made two different scaling plots. I think the second plot (assuming the 2 core run is representative of the true serial performance) is probably more accurate.
Since I was running these jobs interactively, I was able to get a pretty good feel for which parts of the calculation were most time-consuming. As the plots above show, beyond 8 processors, the projection operation was so fast that increasing the processor count really didn't help much. Most of the overhead is in parsing the hierarchy, setting up the MPI communicators, and communicating and assembling the projection on all processors at the end, in rough order of importance.
This is also quite memory intensive - each core (independent of the global number of cores) needed at least 4.5 gigabytes of memory.
Nathan Goldbaum Graduate Student Astronomy & Astrophysics, UCSC goldbaum@ucolick.org http://www.ucolick.org/~goldbaum
On May 4, 2012, at 4:20 AM, Matthew Turk wrote:
Hi Sam,
Thanks a ton. This looks good to me, seeing as how at few tasks we have the overhead of creating the tree, and at many tasks we'll have collective operations. I'll try to get ahold of another testing machine and then I'll issue a PR. (And close Issue #348!)
-Matt
On Thu, May 3, 2012 at 6:47 PM, Sam Skillman
wrote: Meant to include the scaling image.
On Thu, May 3, 2012 at 4:44 PM, Sam Skillman
wrote: Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids
using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I
got fairly good scaling and made a quick mercurial repo on bitbucket with
everything except the dataset needed to do a similar
study. https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing:
projects/quad_proj_scale:more perf.dat
64 2.444e+01
32 4.834e+01
16 7.364e+01
8 1.125e+02
4 1.853e+02
2 3.198e+02
1 6.370e+02
A few notes:
-- I ran with 64 cores first, then again so that the disks were somewhat
warmed up, then only used the second timing of the 64 core run.
-- While I did get full nodes, the machine doesn't have a ton of I/O nodes
so in an ideal setting performance may be even better.
-- My guess would be that a lot of this speedup comes from having a
parallel filesystem, so you may not get as great of speedups on your laptop.
-- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk
wrote:
Hi all,
I implemented this "quadtree extension" that duplicates the quadtree
on all processors, which may make it nicer to scale projections.
Previously the procedure was:
1) Locally project
2) Merge across procs:
2a) Serialize quadtree
2b) Point-to-point communciate
2c) Deserialize
2d) Merge local and remote
2d) Repeat up to 2a
3) Finish
I've added a step 0) which is "initialize entire quadtree", which
means all of step 2 becomes "perform sum of big array on all procs."
This has good and bad elements: we're still doing a lot of heavy
communication across processors, but it will be managed by the MPI
implementation instead of by yt. Also, we avoid all of the costly
serialize/deserialize procedures. So for a given dataset, step 0 will
be fixed in cost, but step 1 will be reduced as the number of
processors goes up. Step 2, which now is a single (or two)
communication steps, will increase in cost with increasing number of
processors.
So, it's not clear that this will *actually* be helpful or not. It
needs testing, and I've pushed it here:
bb://MatthewTurk/yt/
hash 3f39eb7bf468
If anybody out there could test it, I'd be might glad. This is the
script I've been using:
http://paste.yt-project.org/show/2343/
I'd *greatly* appreciate testing results -- particularly for proc
combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results
are somewhat inconclusive. Keep in mind you'll have to run with the
option:
--config serialize=False
to get real results. Here's the shell command I used:
( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py
--parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
Comparison against results from the old method would also be super
helpful.
The alternate idea that I'd had was a bit different, but harder to
implement, and also with a glaring problem. The idea would be to
serialize arrays, do the butterfly reduction, but instead of
converting into data objects simply progressively walk hilbert
indices. Unfortunately this only works for up to 2^32 effective size,
which is not going to work in a lot of cases.
Anyway, if this doesn't work, I'd be eager to hear if anybody has any
ideas. :)
-Matt
_______________________________________________
yt-dev mailing list
yt-dev@lists.spacepope.org
http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________
yt-dev mailing list
yt-dev@lists.spacepope.org
http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
!DSPAM:10175,4fa3c167802953112396!
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
John tested the quadtree projection improvements in PR #156 (
https://bitbucket.org/yt_analysis/yt/pull-request/156/change-the-way-quad-tr...
) and found them to improve performance, reduce peak memory usage, and
flatten overall memory usage. He accepted the PR, so parallel
projections should be substantially improved now.
-Matt
On Mon, May 7, 2012 at 9:18 AM, Matthew Turk
Hi everyone,
Sam and I also had an off-list discussion a bit about this, and we saw similar but not identical results to what Nathan has reported. (Specifically, Sam sees a lot less memory overhead than Nathan does.) Sorry for dropping this over the weekend and bringing it back up now; I had a lot going on the last couple days. I think the key point in what Nathan writes is that the QT gets to be fast enough to be overwhelmed by overhead. Also, it's not clear to me if the numbers Nathan is reporting include hierarchy instantiation or not.
This leads to a couple things --
* How can overhead be reduced? This includes both the time to generate grid objects in memory as well as the memory of those grid objects and the affiliated hierarchy. Parsing the file off disk is a one-time affair, so I am substantially less concerned about that. * Where are the weak spots in the projection algorithm? At some point, we need to do a global reduce. The mechanism that I've identified here should feature the least amount of memory allocation, but it could perhaps be improved in other ways. * What are the low-hanging fruits of optimization inside the existing algorithm?
There are a few operations that I already know of problems in (2D Profiles is a big one), which are slated to be dealt with in the 2.5 cycle.
A while back for the Blue Waters project I wrote up a little bit of infrastructure to do profiling of time spent in various routines. I'm going to take time to resurrect this, and add in memory management profiling as well. Following this, I'll run this regularly to inspect the current state of the code. What I'd really like to ask for is help with identifying routines that may need to be inspected in detail, and trying to figure out how we can speed them up. I'm not interested in ceding performance territory, but unfortunately this is an area we haven't really ever discussed openly on this list.
So:
What routines are slow?
-Matt
On Fri, May 4, 2012 at 8:43 PM, Nathan Goldbaum
wrote: Hi all,
I just did a scaling test on Pleiades at NASA Ames and got somewhat worse scaling at high processor counts. This is with one Matt's datasets so that might be the issue.
Here's some summary data for the hierarchy: http://paste.yt-project.org/show/2348/
I actually found superlinear scaling going from 1 processor to 2 processors so I made two different scaling plots. I think the second plot (assuming the 2 core run is representative of the true serial performance) is probably more accurate.
Since I was running these jobs interactively, I was able to get a pretty good feel for which parts of the calculation were most time-consuming. As the plots above show, beyond 8 processors, the projection operation was so fast that increasing the processor count really didn't help much. Most of the overhead is in parsing the hierarchy, setting up the MPI communicators, and communicating and assembling the projection on all processors at the end, in rough order of importance.
This is also quite memory intensive - each core (independent of the global number of cores) needed at least 4.5 gigabytes of memory.
Nathan Goldbaum Graduate Student Astronomy & Astrophysics, UCSC goldbaum@ucolick.org http://www.ucolick.org/~goldbaum
On May 4, 2012, at 4:20 AM, Matthew Turk wrote:
Hi Sam,
Thanks a ton. This looks good to me, seeing as how at few tasks we have the overhead of creating the tree, and at many tasks we'll have collective operations. I'll try to get ahold of another testing machine and then I'll issue a PR. (And close Issue #348!)
-Matt
On Thu, May 3, 2012 at 6:47 PM, Sam Skillman
wrote: Meant to include the scaling image.
On Thu, May 3, 2012 at 4:44 PM, Sam Skillman
wrote: Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids
using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I
got fairly good scaling and made a quick mercurial repo on bitbucket with
everything except the dataset needed to do a similar
study. https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing:
projects/quad_proj_scale:more perf.dat
64 2.444e+01
32 4.834e+01
16 7.364e+01
8 1.125e+02
4 1.853e+02
2 3.198e+02
1 6.370e+02
A few notes:
-- I ran with 64 cores first, then again so that the disks were somewhat
warmed up, then only used the second timing of the 64 core run.
-- While I did get full nodes, the machine doesn't have a ton of I/O nodes
so in an ideal setting performance may be even better.
-- My guess would be that a lot of this speedup comes from having a
parallel filesystem, so you may not get as great of speedups on your laptop.
-- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk
wrote:
Hi all,
I implemented this "quadtree extension" that duplicates the quadtree
on all processors, which may make it nicer to scale projections.
Previously the procedure was:
1) Locally project
2) Merge across procs:
2a) Serialize quadtree
2b) Point-to-point communciate
2c) Deserialize
2d) Merge local and remote
2d) Repeat up to 2a
3) Finish
I've added a step 0) which is "initialize entire quadtree", which
means all of step 2 becomes "perform sum of big array on all procs."
This has good and bad elements: we're still doing a lot of heavy
communication across processors, but it will be managed by the MPI
implementation instead of by yt. Also, we avoid all of the costly
serialize/deserialize procedures. So for a given dataset, step 0 will
be fixed in cost, but step 1 will be reduced as the number of
processors goes up. Step 2, which now is a single (or two)
communication steps, will increase in cost with increasing number of
processors.
So, it's not clear that this will *actually* be helpful or not. It
needs testing, and I've pushed it here:
bb://MatthewTurk/yt/
hash 3f39eb7bf468
If anybody out there could test it, I'd be might glad. This is the
script I've been using:
http://paste.yt-project.org/show/2343/
I'd *greatly* appreciate testing results -- particularly for proc
combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results
are somewhat inconclusive. Keep in mind you'll have to run with the
option:
--config serialize=False
to get real results. Here's the shell command I used:
( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py
--parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
Comparison against results from the old method would also be super
helpful.
The alternate idea that I'd had was a bit different, but harder to
implement, and also with a glaring problem. The idea would be to
serialize arrays, do the butterfly reduction, but instead of
converting into data objects simply progressively walk hilbert
indices. Unfortunately this only works for up to 2^32 effective size,
which is not going to work in a lot of cases.
Anyway, if this doesn't work, I'd be eager to hear if anybody has any
ideas. :)
-Matt
_______________________________________________
yt-dev mailing list
yt-dev@lists.spacepope.org
http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________
yt-dev mailing list
yt-dev@lists.spacepope.org
http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
!DSPAM:10175,4fa3c167802953112396!
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt,
I wanted to let you know that I'll try this out as part of the L7 work. I've just been blocked from SSH for a few days.
--Rick
On May 4, 2012, at 4:28 AM, "Matthew Turk"
Hi Sam,
Thanks a ton. This looks good to me, seeing as how at few tasks we have the overhead of creating the tree, and at many tasks we'll have collective operations. I'll try to get ahold of another testing machine and then I'll issue a PR. (And close Issue #348!)
-Matt
On Thu, May 3, 2012 at 6:47 PM, Sam Skillman
wrote: Meant to include the scaling image.
On Thu, May 3, 2012 at 4:44 PM, Sam Skillman
wrote: Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I got fairly good scaling and made a quick mercurial repo on bitbucket with everything except the dataset needed to do a similar study. https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing: projects/quad_proj_scale:more perf.dat 64 2.444e+01 32 4.834e+01 16 7.364e+01 8 1.125e+02 4 1.853e+02 2 3.198e+02 1 6.370e+02
A few notes: -- I ran with 64 cores first, then again so that the disks were somewhat warmed up, then only used the second timing of the 64 core run. -- While I did get full nodes, the machine doesn't have a ton of I/O nodes so in an ideal setting performance may be even better. -- My guess would be that a lot of this speedup comes from having a parallel filesystem, so you may not get as great of speedups on your laptop. -- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk
wrote: Hi all,
I implemented this "quadtree extension" that duplicates the quadtree on all processors, which may make it nicer to scale projections. Previously the procedure was:
1) Locally project 2) Merge across procs: 2a) Serialize quadtree 2b) Point-to-point communciate 2c) Deserialize 2d) Merge local and remote 2d) Repeat up to 2a 3) Finish
I've added a step 0) which is "initialize entire quadtree", which means all of step 2 becomes "perform sum of big array on all procs." This has good and bad elements: we're still doing a lot of heavy communication across processors, but it will be managed by the MPI implementation instead of by yt. Also, we avoid all of the costly serialize/deserialize procedures. So for a given dataset, step 0 will be fixed in cost, but step 1 will be reduced as the number of processors goes up. Step 2, which now is a single (or two) communication steps, will increase in cost with increasing number of processors.
So, it's not clear that this will *actually* be helpful or not. It needs testing, and I've pushed it here:
bb://MatthewTurk/yt/ hash 3f39eb7bf468
If anybody out there could test it, I'd be might glad. This is the script I've been using:
http://paste.yt-project.org/show/2343/
I'd *greatly* appreciate testing results -- particularly for proc combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results are somewhat inconclusive. Keep in mind you'll have to run with the option:
--config serialize=False
to get real results. Here's the shell command I used:
( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
Comparison against results from the old method would also be super helpful.
The alternate idea that I'd had was a bit different, but harder to implement, and also with a glaring problem. The idea would be to serialize arrays, do the butterfly reduction, but instead of converting into data objects simply progressively walk hilbert indices. Unfortunately this only works for up to 2^32 effective size, which is not going to work in a lot of cases.
Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas. :)
-Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
participants (4)
-
Matthew Turk
-
Nathan Goldbaum
-
Richard P Wagner
-
Sam Skillman