I prepared some strong scaling plots today (and I'm actually quite pleased with them -- I will share them next week) and discovered a couple items of interest.
* Profile1D is substantially slower than Profile2D. I'm currently investigating why, but my current working belief is that the usage of pure-Python for conducting the histogram slows us down, versus the hand-coded C routine I wrote for 2D and 3D histograms. (BinDProfile in yt/utilities/data_point_utilities.c). I'll probably rewrite the 1D into C at some point. * Projections (old-style, not quadtree) scale much better than I think I realized, up to ~64 processors. At 128 on the dataset I tested on (512^3 L7) the algorithmic overhead and processor starvation combined to reduce scaling substantially. The good news is that it still only takes 40 seconds on 64 processors. This was on Triton, 8x8 node topology. I'm confident that on bigger datasets it would scale further. * With the coalescing of grid/cpu reading in the yt-Enzo code IO is not really an issue at the moment. * DerivedQuantities use _mpi_catlist, which I think I wrote something like two and a half years ago. At that time, I avoided any collective communications or non-blocking combines, so it proceeded by pickling lists that got passed over the wire (one by one) to the root processor, where they were joined and then broadcast back. This is extremely slow. The current mechanism that I think makes the most sense is to use the _recv_arrays routine that manages an alltoallv call.
I've prepared two pastes. The first is a patch:
that converts the Derived Quantity parallel join to _mpi_catarrays and converts _mpi_catarrays to use the alltoallv wrapper
The second is the test script I used:
which I was only able to test on my laptop this evening, as Triton's disk is down for a bit. It produced bitwise identical (once I handled the transposes correctly!) between parallel and serial.
If anyone that uses parallelism in yt heavily has a chance I'd really appreciate it if you could run this script on a dataset you've got. If you've got other big parallel jobs that use the DerivedQuantities or Slice mechanisms (as those are the ones that have been touched) that would also be great to hear back about.