
Hi YT devs, I was asked by the sys admin at NICS about a part of YT that I'm not very familiar with: "Is the particle IO in YT that calls h5py spawned by multiple processors or is it doing it serially?" We're trying to solve a memory issue we're seeing in the parallelHF, and Stephen has been making modifications that I'm currently trying out, but we wanted to see if we can identify the source of the problem. I told the admin that parallel projection works with each processor reading in only a portion of the field variables, but I couldn't be sure about the particle data, so I wanted to verify with people on the dev list. From G.S.

Geoffrey,
"Is the particle IO in YT that calls h5py spawned by multiple processors or is it doing it serially?"
For your purposes, h5py is only used to *write* particle data to disk after the halos have been found (if you are saving them to disk, which you must do explicitly, of course). And in this case, it will open up one file using h5py per MPI task. I'm guessing that they're actually concerned about reading particle data, because that is more disk intensive. This is done with functions written in C that read the data, not h5py. Here each MPI task does its own reading of data, and may open up multiple files to retrieve the particle data it needs depending on the layouts of grids in the .cpuNNNN files. Does that help? -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)

Ah yes, I think that answers our question. We were worried that all the particles were read in by each processor (which I told him I don't think it did, or it would have crashed my smaller 800 cube long ago), but I wanted to get the answer from pros. Thanks! From G.S. On Tue, Oct 18, 2011 at 4:21 PM, Stephen Skory <s@skory.us> wrote:
Geoffrey,
"Is the particle IO in YT that calls h5py spawned by multiple processors or is it doing it serially?"
For your purposes, h5py is only used to *write* particle data to disk after the halos have been found (if you are saving them to disk, which you must do explicitly, of course). And in this case, it will open up one file using h5py per MPI task.
I'm guessing that they're actually concerned about reading particle data, because that is more disk intensive. This is done with functions written in C that read the data, not h5py. Here each MPI task does its own reading of data, and may open up multiple files to retrieve the particle data it needs depending on the layouts of grids in the .cpuNNNN files.
Does that help?
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org

Geoffrey, Parallel HOP definitely does not attempt to load all of the particles, simultaneously, on all processors. This is covered in the method papers for both p-hop and yt, the documentation for yt, the source code, and I believe on the yt-users mailing list a couple times when discussing estimates for resource usage in p-hop. The struggles you have been having with Nautilus may in fact be a yt problem, or an application-of-yt problem, a software problem on Nautilus, or even (if Nautilus is being exposed to an excessive number of cosmic rays, for instance) a hardware problem. It would probably be productive to properly debug exactly what is going on for you to provide to us: 1) What are you attempting to do, precisely? 2) What type of data, and what size of data, are you applying this to? 3) What is the version of yt you are using (changeset hash)? 4) How are you launching yt? 5) What is the memory available to each individual process? 6) Under what circumstances does yt crash? 7) How does yt report this crash to you, and is it deterministic? 8) What have you attempted? How did it change #6 and #7? We're interested in ensuring that yt functions well on Nautilus, and that it is able to successfully halo find, analyze, etc. However, right now it feels like we're being given about 10% of a bug report, and that is regrettably not enough to properly diagnose and repair the problem. Thanks, Matt On Tue, Oct 18, 2011 at 7:51 PM, Geoffrey So <gsiisg@gmail.com> wrote:
Ah yes, I think that answers our question. We were worried that all the particles were read in by each processor (which I told him I don't think it did, or it would have crashed my smaller 800 cube long ago), but I wanted to get the answer from pros. Thanks! From G.S.
On Tue, Oct 18, 2011 at 4:21 PM, Stephen Skory <s@skory.us> wrote:
Geoffrey,
"Is the particle IO in YT that calls h5py spawned by multiple processors or is it doing it serially?"
For your purposes, h5py is only used to *write* particle data to disk after the halos have been found (if you are saving them to disk, which you must do explicitly, of course). And in this case, it will open up one file using h5py per MPI task.
I'm guessing that they're actually concerned about reading particle data, because that is more disk intensive. This is done with functions written in C that read the data, not h5py. Here each MPI task does its own reading of data, and may open up multiple files to retrieve the particle data it needs depending on the layouts of grids in the .cpuNNNN files.
Does that help?
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org

Sorry for the fragmented pieces of info, I was trying to determine what the problem is with one of the sys admin at Nautilus, so I'm not even sure yet if it is YT's problem. Symptoms: paralleHF fails for the 3200 cube dataset, but not always at the same place, which leads us to think this might be an memory issue. 1) What are you attempting to do, precisely? Currently I'm trying to run parallelHF on pieces of the subvolume since I've found out the memory requirement of the whole dataset exceeds the machine's available memory (Nautilus with 4TB shared memory). 2) What type of data, and what size of data, are you applying this to? I'm doing parallelHF with DM only on a piece of the subvolume that's 1/64th of the original volume. 3) What is the version of yt you are using (changeset hash)? Was using the latest YT as of last week when I ran the unsuccessful runs, currently trying Stephen's modification which should help with memory: (dev-yt)Geoffreys-MacBook-Air:yt-hg gso$ hg identify 2efcec06484e (yt) tip I am going to modify my script and send it to the sys admin to run test on the 800 cube first I've been asked not to submit jobs of the 3200 because the last time I did it, it brought half the machine to a standstill 4) How are you launching yt? I was launching it with 512 cores and 2TB of total memory, but they said to try to decrease the mpi task count so I've also tried 256, 64, 32 but they all failed after a while, a couple was doing fine during the parallelHF phase but suddenly ended with: MPI: MPI_COMM_WORLD rank 6 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 9 5) What is the memory available to each individual process? I've usually launched the 3200 with 2TB of memory with varying mpi task counts from 32 to 512. 6) Under what circumstances does yt crash? I've also had P100 yt : [INFO ] 2011-10-03 08:03:06,125 Getting field particle_position_x from 112 MPI: MPI_COMM_WORLD rank 153 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 9 asallocash failed: system error trying to write a message header - Broken pipe and with the same script P180 yt : [INFO ] 2011-10-03 15:12:01,898 Finished with binary hierarchy reading Traceback (most recent call last): File "regionPHOP.py", line 23, in <module> sv = pf.h.region([i * delta[0] + delta[0] / 2.0, File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/static_output.py", line 169, in hierarchy self, data_style=self.data_style) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/frontends/enzo/data_structures.py", line 162, in __init__ AMRHierarchy.__init__(self, pf, data_style) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 79, in __init__ self._detect_fields() File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/frontends/enzo/data_structures.py", line 405, in _detect_fields self.save_data(list(field_list),"/","DataFields",passthrough=True) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 216, in in_order f1(*args, **kwargs) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 222, in _save_data arr = myGroup.create_dataset(name,data=array) File "/nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/h5py-1.3.1-py2.7-linux-x86_64.egg/h5py/highlevel.py", line 464, in create_dataset return Dataset(self, name, *args, **kwds) File "/nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/h5py-1.3.1-py2.7-linux-x86_64.egg/h5py/highlevel.py", line 1092, in __init__ space_id = h5s.create_simple(shape, maxshape) File "h5s.pyx", line 103, in h5py.h5s.create_simple (h5py/h5s.c:952) h5py._stub.ValueError: Zero sized dimension for non-unlimited dimension (Invalid arguments to routine: Bad value) 7) How does yt report this crash to you, and is it deterministic? And many times there isn't any associated error output in the logs, the process just hangs and become non-responsive, the admin has tried it a couple times and seeing the different errors on 2 different dataset, so right now it can also be the dataset that is corrupted, but so far not deterministic. 8) What have you attempted? How did it change #6 and #7? I've tried: - adding the environmental variables: export MPI_BUFS_PER_PROC=64 export MPI_BUFS_PER_HOST=256 with no change in behavior, resulting in MPI_Finalize() error sometimes - using my own installation of OpenMPI from yt.mods import * File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/mods.py", line 44, in <module> from yt.data_objects.api import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/api.py", line 34, in <module> from hierarchy import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 40, in <module> from yt.utilities.parallel_tools.parallel_analysis_interface import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 49, in <module> from mpi4py import MPI ImportError: /nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/mpi4py/MPI.so: undefined symbol: mpi_sgi_inplace The system admin says there are bugs or incompatibilities with the network and I should use SGI's MPI by using the module mpt/2.04 which I was using before trying my own installation of openmpi. - currently modifying my script with Stephen's proposed changes, once it runs on my laptop will let the sys admin try it on the small dataset of 800 cube before trying it on the 3200 dataset. At least when his job hangs the machine he can terminate it faster without waiting for someone to answer his emails. Hopefully these tests wouldn't be too much of a disruption to other Nautilus users. - was speaking to Brian Crosby during the enzo meeting briefly about this, he said he's encountered MPI errors on Nautilus as well, but his issue might be a different one than mine. This may or may not be a YT issue after all, but since it seems like multiple people are interested in YT's performance on Nautilus, I'll keep everyone updated with the latest development. From G.S. On Tue, Oct 18, 2011 at 7:59 PM, Matthew Turk <matthewturk@gmail.com> wrote:
Geoffrey,
Parallel HOP definitely does not attempt to load all of the particles, simultaneously, on all processors. This is covered in the method papers for both p-hop and yt, the documentation for yt, the source code, and I believe on the yt-users mailing list a couple times when discussing estimates for resource usage in p-hop.
The struggles you have been having with Nautilus may in fact be a yt problem, or an application-of-yt problem, a software problem on Nautilus, or even (if Nautilus is being exposed to an excessive number of cosmic rays, for instance) a hardware problem. It would probably be productive to properly debug exactly what is going on for you to provide to us:
1) What are you attempting to do, precisely? 2) What type of data, and what size of data, are you applying this to? 3) What is the version of yt you are using (changeset hash)? 4) How are you launching yt? 5) What is the memory available to each individual process? 6) Under what circumstances does yt crash? 7) How does yt report this crash to you, and is it deterministic? 8) What have you attempted? How did it change #6 and #7?
We're interested in ensuring that yt functions well on Nautilus, and that it is able to successfully halo find, analyze, etc. However, right now it feels like we're being given about 10% of a bug report, and that is regrettably not enough to properly diagnose and repair the problem.
Thanks,
Matt
Ah yes, I think that answers our question. We were worried that all the particles were read in by each processor (which I told him I don't think it did, or it would have crashed my smaller 800 cube long ago), but I wanted to get the answer from pros. Thanks! From G.S.
On Tue, Oct 18, 2011 at 4:21 PM, Stephen Skory <s@skory.us> wrote:
Geoffrey,
"Is the particle IO in YT that calls h5py spawned by multiple
On Tue, Oct 18, 2011 at 7:51 PM, Geoffrey So <gsiisg@gmail.com> wrote: processors
or is it doing it serially?"
For your purposes, h5py is only used to *write* particle data to disk after the halos have been found (if you are saving them to disk, which you must do explicitly, of course). And in this case, it will open up one file using h5py per MPI task.
I'm guessing that they're actually concerned about reading particle data, because that is more disk intensive. This is done with functions written in C that read the data, not h5py. Here each MPI task does its own reading of data, and may open up multiple files to retrieve the particle data it needs depending on the layouts of grids in the .cpuNNNN files.
Does that help?
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org

Hi Geoffrey, Thank you *very* much for your detailed response! All of this sounds like memory errors. I don't think it's a problem with Nautilus (although I personally experienced problems with the old GPFS filesystem on Nautilus, long ago.) I have a few followup questions for Stephen: * Does parallel HOP still dynamically load balance? To do so, does it conduct histograms across datasets (i.e., similar to how we subselect the particles for a region by striding over them) or does it load, evaluate, discard? * What multiple of the total dataset memory size is necessary to p-HOP an ideally load balanced set of particles? * Are there any points in the code where the root processor is used as a primary staging location, or where the arrays are duplicated in some large amount on the root processor? * Are there any points where fields are duplicated? What about fancy-indexing, or implicit copies? Do you think it is reasonable, on a large system, to halo find a dataset of this size? Is it feasible to construct resource estimates for ideally-balanced datasets? Thanks for any ideas, Matt On Tue, Oct 18, 2011 at 11:50 PM, Geoffrey So <gsiisg@gmail.com> wrote:
Sorry for the fragmented pieces of info, I was trying to determine what the problem is with one of the sys admin at Nautilus, so I'm not even sure yet if it is YT's problem. Symptoms: paralleHF fails for the 3200 cube dataset, but not always at the same place, which leads us to think this might be an memory issue. 1) What are you attempting to do, precisely? Currently I'm trying to run parallelHF on pieces of the subvolume since I've found out the memory requirement of the whole dataset exceeds the machine's available memory (Nautilus with 4TB shared memory). 2) What type of data, and what size of data, are you applying this to? I'm doing parallelHF with DM only on a piece of the subvolume that's 1/64th of the original volume. 3) What is the version of yt you are using (changeset hash)? Was using the latest YT as of last week when I ran the unsuccessful runs, currently trying Stephen's modification which should help with memory: (dev-yt)Geoffreys-MacBook-Air:yt-hg gso$ hg identify 2efcec06484e (yt) tip I am going to modify my script and send it to the sys admin to run test on the 800 cube first I've been asked not to submit jobs of the 3200 because the last time I did it, it brought half the machine to a standstill 4) How are you launching yt? I was launching it with 512 cores and 2TB of total memory, but they said to try to decrease the mpi task count so I've also tried 256, 64, 32 but they all failed after a while, a couple was doing fine during the parallelHF phase but suddenly ended with: MPI: MPI_COMM_WORLD rank 6 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 9 5) What is the memory available to each individual process? I've usually launched the 3200 with 2TB of memory with varying mpi task counts from 32 to 512. 6) Under what circumstances does yt crash? I've also had P100 yt : [INFO ] 2011-10-03 08:03:06,125 Getting field particle_position_x from 112 MPI: MPI_COMM_WORLD rank 153 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 9
asallocash failed: system error trying to write a message header - Broken pipe and with the same script P180 yt : [INFO ] 2011-10-03 15:12:01,898 Finished with binary hierarchy reading Traceback (most recent call last): File "regionPHOP.py", line 23, in <module> sv = pf.h.region([i * delta[0] + delta[0] / 2.0, File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/static_output.py", line 169, in hierarchy self, data_style=self.data_style) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/frontends/enzo/data_structures.py", line 162, in __init__ AMRHierarchy.__init__(self, pf, data_style) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 79, in __init__ self._detect_fields() File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/frontends/enzo/data_structures.py", line 405, in _detect_fields self.save_data(list(field_list),"/","DataFields",passthrough=True) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 216, in in_order f1(*args, **kwargs) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 222, in _save_data arr = myGroup.create_dataset(name,data=array) File "/nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/h5py-1.3.1-py2.7-linux-x86_64.egg/h5py/highlevel.py", line 464, in create_dataset return Dataset(self, name, *args, **kwds) File "/nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/h5py-1.3.1-py2.7-linux-x86_64.egg/h5py/highlevel.py", line 1092, in __init__ space_id = h5s.create_simple(shape, maxshape) File "h5s.pyx", line 103, in h5py.h5s.create_simple (h5py/h5s.c:952) h5py._stub.ValueError: Zero sized dimension for non-unlimited dimension (Invalid arguments to routine: Bad value)
7) How does yt report this crash to you, and is it deterministic?
And many times there isn't any associated error output in the logs, the process just hangs and become non-responsive, the admin has tried it a couple times and seeing the different errors on 2 different dataset, so right now it can also be the dataset that is corrupted, but so far not deterministic. 8) What have you attempted? How did it change #6 and #7? I've tried: - adding the environmental variables: export MPI_BUFS_PER_PROC=64 export MPI_BUFS_PER_HOST=256 with no change in behavior, resulting in MPI_Finalize() error sometimes - using my own installation of OpenMPI from yt.mods import * File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/mods.py", line 44, in <module> from yt.data_objects.api import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/api.py", line 34, in <module> from hierarchy import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 40, in <module> from yt.utilities.parallel_tools.parallel_analysis_interface import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 49, in <module> from mpi4py import MPI ImportError: /nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/mpi4py/MPI.so: undefined symbol: mpi_sgi_inplace The system admin says there are bugs or incompatibilities with the network and I should use SGI's MPI by using the module mpt/2.04 which I was using before trying my own installation of openmpi. - currently modifying my script with Stephen's proposed changes, once it runs on my laptop will let the sys admin try it on the small dataset of 800 cube before trying it on the 3200 dataset. At least when his job hangs the machine he can terminate it faster without waiting for someone to answer his emails. Hopefully these tests wouldn't be too much of a disruption to other Nautilus users. - was speaking to Brian Crosby during the enzo meeting briefly about this, he said he's encountered MPI errors on Nautilus as well, but his issue might be a different one than mine. This may or may not be a YT issue after all, but since it seems like multiple people are interested in YT's performance on Nautilus, I'll keep everyone updated with the latest development.
From G.S. On Tue, Oct 18, 2011 at 7:59 PM, Matthew Turk <matthewturk@gmail.com> wrote:
Geoffrey,
Parallel HOP definitely does not attempt to load all of the particles, simultaneously, on all processors. This is covered in the method papers for both p-hop and yt, the documentation for yt, the source code, and I believe on the yt-users mailing list a couple times when discussing estimates for resource usage in p-hop.
The struggles you have been having with Nautilus may in fact be a yt problem, or an application-of-yt problem, a software problem on Nautilus, or even (if Nautilus is being exposed to an excessive number of cosmic rays, for instance) a hardware problem. It would probably be productive to properly debug exactly what is going on for you to provide to us:
1) What are you attempting to do, precisely? 2) What type of data, and what size of data, are you applying this to? 3) What is the version of yt you are using (changeset hash)? 4) How are you launching yt? 5) What is the memory available to each individual process? 6) Under what circumstances does yt crash? 7) How does yt report this crash to you, and is it deterministic? 8) What have you attempted? How did it change #6 and #7?
We're interested in ensuring that yt functions well on Nautilus, and that it is able to successfully halo find, analyze, etc. However, right now it feels like we're being given about 10% of a bug report, and that is regrettably not enough to properly diagnose and repair the problem.
Thanks,
Matt
On Tue, Oct 18, 2011 at 7:51 PM, Geoffrey So <gsiisg@gmail.com> wrote:
Ah yes, I think that answers our question. We were worried that all the particles were read in by each processor (which I told him I don't think it did, or it would have crashed my smaller 800 cube long ago), but I wanted to get the answer from pros. Thanks! From G.S.
On Tue, Oct 18, 2011 at 4:21 PM, Stephen Skory <s@skory.us> wrote:
Geoffrey,
"Is the particle IO in YT that calls h5py spawned by multiple processors or is it doing it serially?"
For your purposes, h5py is only used to *write* particle data to disk after the halos have been found (if you are saving them to disk, which you must do explicitly, of course). And in this case, it will open up one file using h5py per MPI task.
I'm guessing that they're actually concerned about reading particle data, because that is more disk intensive. This is done with functions written in C that read the data, not h5py. Here each MPI task does its own reading of data, and may open up multiple files to retrieve the particle data it needs depending on the layouts of grids in the .cpuNNNN files.
Does that help?
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org

Hi Stephen, I spent a bit of time looking into pHOP's memory usage, and I think there are some obvious places it could be improved. Much of this is due to your usage of the halo objects and structures I wrote, a long time ago; I wonder if in a version 2.0 of pHOP this could be jettisoned (as they were not designed to be anything other than "something that worked" and they manifestly no longer do) and a new system that reflects our new understanding could be used. The places where I see memory jump that I think can be avoided: * Copying fields, without removing them, from self._data_source in __obtain_particles * Initializing ParallelHOPHaloFinder with copies (by dividing) of position and mass fields * Copying position, mass into the fKD object without removing them from the halo finder (is there any reason they can't simply be moved in and removed from the halo finder, then copied back out?) * Rearrange = True is the default, which I believe copies the entire kD-tree inside fKD? It also looks like some of this is because the fKD tree requires position to be (N,3) for memory access speed. I've been running this on a test 256^3 dataset (recall 256^3 * 64 bits * [posx,posy,posz,mass,index] = 0.625 gigs), inserting calls to get_memory_usage() at various stages. Initially the peak memory usage of pHOP was about 4.5 gigs. By inserting deletions of the _data_source fields and removing the division step, I was able to reduce the peak memory usage *before* construction of the kD-tree to 1.4 gigs. Afterward, it went up to 2.8 gigs. By changing rearrange = False, the peak went up to 2.5 gigs instead. I wasn't able to reduce memory usage by deleting (even as an intermediate step) self.xpos, self.ypos, self.zpos, which then led me to change how ParallelHOPHaloFinder was initialized, by instead passing in the actual particle_fields dictionary. During the __init__ function on ParallelHOPHaloFinder I then pop'ed these fields out when setting self.xpos etc. This reduced peak memory down to ~1.5 gigs before density was calculated, at which point it hits 1.8 gigs. Unfortunately, it's not entirely clear to me how we then copy back out of the Forthon kD-tree at the right time; I don't quite now the inner workings of pHOP. But I think this is valid, as I don't believe at any point information is lost. I think a combination of reducing memory copies and reducing reliance on my old, unnecessary object classes might be able to dramatically improve pHOP. Can we take a look together? My (currently broken!) patch is here, if it helps provide a starting point: http://paste.yt-project.org/show/1875/ Thanks for any ideas, Matt On Wed, Oct 19, 2011 at 12:03 AM, Matthew Turk <matthewturk@gmail.com> wrote:
Hi Geoffrey,
Thank you *very* much for your detailed response!
All of this sounds like memory errors. I don't think it's a problem with Nautilus (although I personally experienced problems with the old GPFS filesystem on Nautilus, long ago.)
I have a few followup questions for Stephen:
* Does parallel HOP still dynamically load balance? To do so, does it conduct histograms across datasets (i.e., similar to how we subselect the particles for a region by striding over them) or does it load, evaluate, discard? * What multiple of the total dataset memory size is necessary to p-HOP an ideally load balanced set of particles? * Are there any points in the code where the root processor is used as a primary staging location, or where the arrays are duplicated in some large amount on the root processor? * Are there any points where fields are duplicated? What about fancy-indexing, or implicit copies?
Do you think it is reasonable, on a large system, to halo find a dataset of this size? Is it feasible to construct resource estimates for ideally-balanced datasets?
Thanks for any ideas,
Matt
On Tue, Oct 18, 2011 at 11:50 PM, Geoffrey So <gsiisg@gmail.com> wrote:
Sorry for the fragmented pieces of info, I was trying to determine what the problem is with one of the sys admin at Nautilus, so I'm not even sure yet if it is YT's problem. Symptoms: paralleHF fails for the 3200 cube dataset, but not always at the same place, which leads us to think this might be an memory issue. 1) What are you attempting to do, precisely? Currently I'm trying to run parallelHF on pieces of the subvolume since I've found out the memory requirement of the whole dataset exceeds the machine's available memory (Nautilus with 4TB shared memory). 2) What type of data, and what size of data, are you applying this to? I'm doing parallelHF with DM only on a piece of the subvolume that's 1/64th of the original volume. 3) What is the version of yt you are using (changeset hash)? Was using the latest YT as of last week when I ran the unsuccessful runs, currently trying Stephen's modification which should help with memory: (dev-yt)Geoffreys-MacBook-Air:yt-hg gso$ hg identify 2efcec06484e (yt) tip I am going to modify my script and send it to the sys admin to run test on the 800 cube first I've been asked not to submit jobs of the 3200 because the last time I did it, it brought half the machine to a standstill 4) How are you launching yt? I was launching it with 512 cores and 2TB of total memory, but they said to try to decrease the mpi task count so I've also tried 256, 64, 32 but they all failed after a while, a couple was doing fine during the parallelHF phase but suddenly ended with: MPI: MPI_COMM_WORLD rank 6 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 9 5) What is the memory available to each individual process? I've usually launched the 3200 with 2TB of memory with varying mpi task counts from 32 to 512. 6) Under what circumstances does yt crash? I've also had P100 yt : [INFO ] 2011-10-03 08:03:06,125 Getting field particle_position_x from 112 MPI: MPI_COMM_WORLD rank 153 has terminated without calling MPI_Finalize() MPI: aborting job MPI: Received signal 9
asallocash failed: system error trying to write a message header - Broken pipe and with the same script P180 yt : [INFO ] 2011-10-03 15:12:01,898 Finished with binary hierarchy reading Traceback (most recent call last): File "regionPHOP.py", line 23, in <module> sv = pf.h.region([i * delta[0] + delta[0] / 2.0, File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/static_output.py", line 169, in hierarchy self, data_style=self.data_style) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/frontends/enzo/data_structures.py", line 162, in __init__ AMRHierarchy.__init__(self, pf, data_style) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 79, in __init__ self._detect_fields() File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/frontends/enzo/data_structures.py", line 405, in _detect_fields self.save_data(list(field_list),"/","DataFields",passthrough=True) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 216, in in_order f1(*args, **kwargs) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 222, in _save_data arr = myGroup.create_dataset(name,data=array) File "/nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/h5py-1.3.1-py2.7-linux-x86_64.egg/h5py/highlevel.py", line 464, in create_dataset return Dataset(self, name, *args, **kwds) File "/nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/h5py-1.3.1-py2.7-linux-x86_64.egg/h5py/highlevel.py", line 1092, in __init__ space_id = h5s.create_simple(shape, maxshape) File "h5s.pyx", line 103, in h5py.h5s.create_simple (h5py/h5s.c:952) h5py._stub.ValueError: Zero sized dimension for non-unlimited dimension (Invalid arguments to routine: Bad value)
7) How does yt report this crash to you, and is it deterministic?
And many times there isn't any associated error output in the logs, the process just hangs and become non-responsive, the admin has tried it a couple times and seeing the different errors on 2 different dataset, so right now it can also be the dataset that is corrupted, but so far not deterministic. 8) What have you attempted? How did it change #6 and #7? I've tried: - adding the environmental variables: export MPI_BUFS_PER_PROC=64 export MPI_BUFS_PER_HOST=256 with no change in behavior, resulting in MPI_Finalize() error sometimes - using my own installation of OpenMPI from yt.mods import * File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/mods.py", line 44, in <module> from yt.data_objects.api import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/api.py", line 34, in <module> from hierarchy import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/data_objects/hierarchy.py", line 40, in <module> from yt.utilities.parallel_tools.parallel_analysis_interface import \ File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 49, in <module> from mpi4py import MPI ImportError: /nics/b/home/gsiisg/NautilusYT/lib/python2.7/site-packages/mpi4py/MPI.so: undefined symbol: mpi_sgi_inplace The system admin says there are bugs or incompatibilities with the network and I should use SGI's MPI by using the module mpt/2.04 which I was using before trying my own installation of openmpi. - currently modifying my script with Stephen's proposed changes, once it runs on my laptop will let the sys admin try it on the small dataset of 800 cube before trying it on the 3200 dataset. At least when his job hangs the machine he can terminate it faster without waiting for someone to answer his emails. Hopefully these tests wouldn't be too much of a disruption to other Nautilus users. - was speaking to Brian Crosby during the enzo meeting briefly about this, he said he's encountered MPI errors on Nautilus as well, but his issue might be a different one than mine. This may or may not be a YT issue after all, but since it seems like multiple people are interested in YT's performance on Nautilus, I'll keep everyone updated with the latest development.
From G.S. On Tue, Oct 18, 2011 at 7:59 PM, Matthew Turk <matthewturk@gmail.com> wrote:
Geoffrey,
Parallel HOP definitely does not attempt to load all of the particles, simultaneously, on all processors. This is covered in the method papers for both p-hop and yt, the documentation for yt, the source code, and I believe on the yt-users mailing list a couple times when discussing estimates for resource usage in p-hop.
The struggles you have been having with Nautilus may in fact be a yt problem, or an application-of-yt problem, a software problem on Nautilus, or even (if Nautilus is being exposed to an excessive number of cosmic rays, for instance) a hardware problem. It would probably be productive to properly debug exactly what is going on for you to provide to us:
1) What are you attempting to do, precisely? 2) What type of data, and what size of data, are you applying this to? 3) What is the version of yt you are using (changeset hash)? 4) How are you launching yt? 5) What is the memory available to each individual process? 6) Under what circumstances does yt crash? 7) How does yt report this crash to you, and is it deterministic? 8) What have you attempted? How did it change #6 and #7?
We're interested in ensuring that yt functions well on Nautilus, and that it is able to successfully halo find, analyze, etc. However, right now it feels like we're being given about 10% of a bug report, and that is regrettably not enough to properly diagnose and repair the problem.
Thanks,
Matt
On Tue, Oct 18, 2011 at 7:51 PM, Geoffrey So <gsiisg@gmail.com> wrote:
Ah yes, I think that answers our question. We were worried that all the particles were read in by each processor (which I told him I don't think it did, or it would have crashed my smaller 800 cube long ago), but I wanted to get the answer from pros. Thanks! From G.S.
On Tue, Oct 18, 2011 at 4:21 PM, Stephen Skory <s@skory.us> wrote:
Geoffrey,
"Is the particle IO in YT that calls h5py spawned by multiple processors or is it doing it serially?"
For your purposes, h5py is only used to *write* particle data to disk after the halos have been found (if you are saving them to disk, which you must do explicitly, of course). And in this case, it will open up one file using h5py per MPI task.
I'm guessing that they're actually concerned about reading particle data, because that is more disk intensive. This is done with functions written in C that read the data, not h5py. Here each MPI task does its own reading of data, and may open up multiple files to retrieve the particle data it needs depending on the layouts of grids in the .cpuNNNN files.
Does that help?
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org

Hi Matt, I've gone through what you started and I fixed the bugs. This diff below produces identical results to the old way. Could you run this on your dataset you used and see if I haven't ballooned the memory? I was using a 40^3 dataset (it's fast!) but it's hard to see big changes in memory using that. If we like this, I'll remove all the prints, and apply it to my yt fork, and ask Geoffrey to try it out. If he sees no big problems, I'll do a pull request. http://paste.enzotools.org/show/1876/
* Initializing ParallelHOPHaloFinder with copies (by dividing) of position and mass fields
I replaced those with na.multiply and na.divide, which should be doing it in-place on the arrays.
* Copying position, mass into the fKD object without removing them from the halo finder (is there any reason they can't simply be moved in and removed from the halo finder, then copied back out?)
No, it should be possible, and I did that in the new diff. I don't know if it's helped.
* Rearrange = True is the default, which I believe copies the entire kD-tree inside fKD?
That's correct, it rearranges the position data so that the nearest neighbor searches are faster, and makes a copy to do this.
It also looks like some of this is because the fKD tree requires position to be (N,3) for memory access speed.
Yes, why I want a C/C++ kdtree someday that is either as fast at this fortran one, or nearly as fast. I have yet to find it.
Unfortunately, it's not entirely clear to me how we then copy back out of the Forthon kD-tree at the right time; I don't quite now the inner workings of pHOP. But I think this is valid, as I don't believe at any point information is lost.
I think I've found the right place, just before the fKD object is deleted and the tree cleared. -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)

Hi Stephen, On Wed, Oct 19, 2011 at 1:18 PM, Stephen Skory <s@skory.us> wrote:
Hi Matt,
I've gone through what you started and I fixed the bugs. This diff below produces identical results to the old way. Could you run this on your dataset you used and see if I haven't ballooned the memory? I was using a 40^3 dataset (it's fast!) but it's hard to see big changes in memory using that. If we like this, I'll remove all the prints, and apply it to my yt fork, and ask Geoffrey to try it out. If he sees no big problems, I'll do a pull request.
http://paste.enzotools.org/show/1876/
* Initializing ParallelHOPHaloFinder with copies (by dividing) of position and mass fields
I replaced those with na.multiply and na.divide, which should be doing it in-place on the arrays.
* Copying position, mass into the fKD object without removing them from the halo finder (is there any reason they can't simply be moved in and removed from the halo finder, then copied back out?)
No, it should be possible, and I did that in the new diff. I don't know if it's helped.
* Rearrange = True is the default, which I believe copies the entire kD-tree inside fKD?
That's correct, it rearranges the position data so that the nearest neighbor searches are faster, and makes a copy to do this.
It also looks like some of this is because the fKD tree requires position to be (N,3) for memory access speed.
Yes, why I want a C/C++ kdtree someday that is either as fast at this fortran one, or nearly as fast. I have yet to find it.
Unfortunately, it's not entirely clear to me how we then copy back out of the Forthon kD-tree at the right time; I don't quite now the inner workings of pHOP. But I think this is valid, as I don't believe at any point information is lost.
I think I've found the right place, just before the fKD object is deleted and the tree cleared.
Your patch drops my memory usage as well. If you think this works, I say go for it! Thanks, Stephen! -Matt
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org

Geoffrey, I'm sure you've been following along, and I'd like to ask you to pull again from my bb fork, and give things a shot. Using the new memory lowering techniques, as well as the total_mass/num_particles stuff I added yesterday ought to help tremendously with execution times and peak memory. Please let us know if this helps! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)

ok, I've already gave the first set of changed script to the admin at Nautilus and I haven't heard back from them. I'll pull this second set of mods and try it out, will keep you guys posted if I hear back. --back to group meeting-- From G.S. On Wed, Oct 19, 2011 at 12:03 PM, Stephen Skory <s@skory.us> wrote:
Geoffrey,
I'm sure you've been following along, and I'd like to ask you to pull again from my bb fork, and give things a shot. Using the new memory lowering techniques, as well as the total_mass/num_particles stuff I added yesterday ought to help tremendously with execution times and peak memory. Please let us know if this helps!
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
participants (3)
-
Geoffrey So
-
Matthew Turk
-
Stephen Skory