Hi John,
yt was using slightly more memory on ranger at 2.1GB/core, which isn't bad at all. This pushed me over the 2GB/core limit on ranger, so I had to use 8 cores/node instead of 16.
Oh, hm, interesting...
However, it was slower by a factor of 2.5. It took 1084 seconds from start to finish (including all of the overhead). I had already created a binary hierarchy beforehand. Ranger in general is slow (I suspect its interconnect), so maybe it's just a "feature" of ranger.
Okay. So that is something that I wonder if we can improve -- particularly since you're already seeing that you need more cores to run anyway. Right now, the mechanism for reading data goes something like this: Projection: For each level: identify grids on this level read all grids for file in all_files_for_these_grids: H5Fopen(File) for each grid in this file: H5Dread(each data set for this grid) So for each file that appears on a given level, the corresponding CPU file is only H5Fopen'd once -- which, with large lustre systems, should help out. (However, it does do multiple, potentially very small, H5Dreads -- but I think we might be able to coalesce these the same way enzo (optionally) can, with the H5P_DATASET_XFER property type, since we're reading into void*'s that should exist through the entirety of the C function. However, one other option would be to allow the projections to preload the entire dataset, rather than just the files needed for that level. If we assume complete grid locality, then our level-by-level *could* have roughly (N_enzo_cpus)/(N_yt_cpus) * N_levels H5Fopens, but it could be a lot worse with the standard enzo load balancing. The projections parallelize by 2D domain decomp, and they define their regions right away. So if we were to preload the entire dataset, rather than level-by-level, we'd use more memory but we'd have fewer H5Fopen calls (which, again, I'm told are the most expensive part of lustre data access.) I've created a patch that handles both of these things, and it's in the hierarchy-opt branch as hash 801b378a22f7. Because this touches the C code, this requires a new install or develop (depending on how you installed the first time.) If you get a chance, could you let me know if this improves things? I think maybe modifying the exact buffer size inside yt/lagos/HDF5LightReader.c may adjust things as well. (I was unable to test if this worked any better, as triton was down...) You can change the mechanism for preloading by setting the argument preload_style to either "level" (currently the default") or "all" (where it loads the entire source that it "owns"). This can be passed through the call to add_projection: pc.add_projection("Density", 0, preload_style='all')
Somewhat related but -- The Alltoallv call was failing when I compiled mpi4py with openmpi, but this went away when I compiled it with mvapich. If
dang it. This looks like I'm just passing around arrays that are too big. I think for this I might need some help from other people about the right way to do this... Ideas, anybody? -Matt