Hi Matt & friends,
I tested this on a fairly large nested simulation with about 60k grids
using 6 nodes of Janus (dual-hex nodes) and ran from 1 to 64 processors. I
got fairly good scaling and made a quick mercurial repo on bitbucket with
everything except the dataset needed to do a similar study.
https://bitbucket.org/samskillman/quad-tree-proj-performance
Raw timing:
projects/quad_proj_scale:more perf.dat
64 2.444e+01
32 4.834e+01
16 7.364e+01
8 1.125e+02
4 1.853e+02
2 3.198e+02
1 6.370e+02
A few notes:
-- I ran with 64 cores first, then again so that the disks were somewhat
warmed up, then only used the second timing of the 64 core run.
-- While I did get full nodes, the machine doesn't have a ton of I/O nodes
so in an ideal setting performance may be even better.
-- My guess would be that a lot of this speedup comes from having a
parallel filesystem, so you may not get as great of speedups on your laptop.
-- Speedup from 32 to 64 is nearly ideal...this is great.
This looks pretty great to me, and I'd +1 any PR.
Sam
On Thu, May 3, 2012 at 1:42 PM, Matthew Turk
Hi all,
I implemented this "quadtree extension" that duplicates the quadtree on all processors, which may make it nicer to scale projections. Previously the procedure was:
1) Locally project 2) Merge across procs: 2a) Serialize quadtree 2b) Point-to-point communciate 2c) Deserialize 2d) Merge local and remote 2d) Repeat up to 2a 3) Finish
I've added a step 0) which is "initialize entire quadtree", which means all of step 2 becomes "perform sum of big array on all procs." This has good and bad elements: we're still doing a lot of heavy communication across processors, but it will be managed by the MPI implementation instead of by yt. Also, we avoid all of the costly serialize/deserialize procedures. So for a given dataset, step 0 will be fixed in cost, but step 1 will be reduced as the number of processors goes up. Step 2, which now is a single (or two) communication steps, will increase in cost with increasing number of processors.
So, it's not clear that this will *actually* be helpful or not. It needs testing, and I've pushed it here:
bb://MatthewTurk/yt/ hash 3f39eb7bf468
If anybody out there could test it, I'd be might glad. This is the script I've been using:
http://paste.yt-project.org/show/2343/
I'd *greatly* appreciate testing results -- particularly for proc combos like 1, 2, 4, 8, 16, 32, 64, ... . On my machine, the results are somewhat inconclusive. Keep in mind you'll have to run with the option:
--config serialize=False
to get real results. Here's the shell command I used:
( for i in 1 2 3 4 5 6 7 8 9 10 ; do mpirun -np ${i} python2.7 proj.py --parallel --config serialize=False ; done ) 2>&1 | tee proj_new.log
Comparison against results from the old method would also be super helpful.
The alternate idea that I'd had was a bit different, but harder to implement, and also with a glaring problem. The idea would be to serialize arrays, do the butterfly reduction, but instead of converting into data objects simply progressively walk hilbert indices. Unfortunately this only works for up to 2^32 effective size, which is not going to work in a lot of cases.
Anyway, if this doesn't work, I'd be eager to hear if anybody has any ideas. :)
-Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org