Re: [Yt-dev] Nautilus: ParallelHaloProfiler I/O issue
Hi Stephen,
I was running parallelHF on Nautilus when I got the following email from the
sys admins, I'll try to answer the questions but let me know if I'm wrong or
there's something else to be said.
(1) How much data are read/written by program ?
- After all the particles (3200^3 of them) are read in they are linked with
a fortran KDtree if they satisfy some conditions.
(2) How many parallel readers/writers are used by program ?
- It is reading using 512 cores from my submission script. The amount of
write to disk depends on the distribution of the particle haloes across
processors, if they exist across processors then there will be more files
written out by write_particle_lists.
(3) Do you use MPI_IO ? or something else ?
- Yes, the program uses mpi4py-1.2.2 installed in my home directory
The details of the code can be found at:
http://yt-project.org/doc/analysis_modules/running_halofinder.html#halo-find...
under the section "Parallel HOP"
Currently I am using 512 cores with 4GB per core for a total of 2TB of ram
for this 3200 cube unigrid simulation, should I decrease the amount of
processors but keeping the same amount of ram? Or is there other ways to
optimize as to not affect other users?
Sorry for the inconvenience.
From
G.S.
On Fri, Sep 23, 2011 at 9:12 AM, Patel, Pragneshkumar B
Hello,
We have noticed some I/O issue on Nautilus. We suspect that "ParallelHaloProfiler" program is doing some very I/O-intensive operations periodically (or may be checkpoint). We like to throttle these back a bit , so that other users are not affected by it.
I like to get some more information about your job #60891. E.g.
(1) How much data are read/written by program ? (2) How many parallel readers/writers are used by program ? (3) Do you use MPI_IO ? or something else ?
Please give me detail and we will work on your I/O issue.
Thanks Pragnesh
Geoffrey,
(1) How much data are read/written by program ? - After all the particles (3200^3 of them) are read in they are linked with a fortran KDtree if they satisfy some conditions.
(2) How many parallel readers/writers are used by program ? - It is reading using 512 cores from my submission script. The amount of write to disk depends on the distribution of the particle haloes across processors, if they exist across processors then there will be more files written out by write_particle_lists. (3) Do you use MPI_IO ? or something else ? - Yes, the program uses mpi4py-1.2.2 installed in my home directory The details of the code can be found at: http://yt-project.org/doc/analysis_modules/running_halofinder.html#halo-find... under the section "Parallel HOP"
The main thing you got wrong is that we do not use MPI_IO. The IO is done primarily through a custom HDF5 reader written in C, and each thread does its own reading. The issue that Pragnesh is probably seeing, and what Geoffrey alludes to, is how load balancing is done. Because of the details of how Enzo stores its data, it is difficult to know where to send the data for load balancing it without reading it all in, first. Out of convenience, once the layout is established, the data is read in again (instead of distributed via communication), this time by the tasks that have been assigned the data. Furthermore, the data as assigned may come from several files, meaning that each task will be opening/closing multiple files multiple times. If all these IO calls are causing a problem, I could see about putting in some kind of IO wait (configurable by the user) that basically slows down the reading part of the process. p.s. Geoffrey - what is the cosmological size of your box? If it's above about 300 Mpc/h, load balancing is probably not necessary, which should roughly half the IO required. -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi Stephen, Pragneshkumar
Let me know if I should move this discussion off the yt-dev list.
The box size is only 80 Mpc or 56 Mpc/h, so load balancing probably help
with peak memory more than hinder, but could still try turning it off.
Would decreasing the amount of cores alleviate some of the symptoms? Or
would that just increase the computation time too much? I haven't done any
scaling tests on the parallelHF.
Incidentally, I've killed the job after seeing the following error, looks
like mpi was having some buffer trouble, which could be caused by the heavy
IO?:
.
.
.
MPI WARNING: Could not allocate an internal buffer in the last 30 seconds
on rank 503. Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.
MPI WARNING: Could not allocate an internal buffer in the last 30 seconds
on rank 505. Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.
MPI WARNING: Could not allocate an internal buffer in the last 30 seconds
on rank 509. Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.
MPI WARNING: Could not allocate an internal buffer in the last 30 seconds
on rank 511. Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST.
Traceback (most recent call last):
File "ParallelHaloProfiler.py", line 17, in <module>
rearrange=True, safety=1.5, premerge=True)
File
"/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/analysis_modules/halo_finding/halo_objects.py",
line 1861, in __init__
root_points = self._subsample_points()
File
"/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/analysis_modules/halo_finding/halo_objects.py",
line 1991, in _subsample_points
root_points = self._mpi_concatenate_array_on_root_double(my_points[0])
File
"/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py",
line 187, in passage
return func(self, data)
File
"/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py",
line 794, in _mpi_concatenate_array_on_root_double
data = na.concatenate((data, new_data))
ValueError: negative dimensions are not allowed
Can I get a refund of the SU on that job?
From
G.S.
On Fri, Sep 23, 2011 at 10:51 AM, Stephen Skory wrote:
Geoffrey,
(1) How much data are read/written by program ? - After all the particles (3200^3 of them) are read in they are linked with a fortran KDtree if they satisfy some conditions.
(2) How many parallel readers/writers are used by program ? - It is reading using 512 cores from my submission script. The amount of write to disk depends on the distribution of the particle haloes across processors, if they exist across processors then there will be more files written out by write_particle_lists. (3) Do you use MPI_IO ? or something else ? - Yes, the program uses mpi4py-1.2.2 installed in my home directory The details of the code can be found at:
http://yt-project.org/doc/analysis_modules/running_halofinder.html#halo-find...
under the section "Parallel HOP"
The main thing you got wrong is that we do not use MPI_IO. The IO is done primarily through a custom HDF5 reader written in C, and each thread does its own reading.
The issue that Pragnesh is probably seeing, and what Geoffrey alludes to, is how load balancing is done. Because of the details of how Enzo stores its data, it is difficult to know where to send the data for load balancing it without reading it all in, first. Out of convenience, once the layout is established, the data is read in again (instead of distributed via communication), this time by the tasks that have been assigned the data. Furthermore, the data as assigned may come from several files, meaning that each task will be opening/closing multiple files multiple times.
If all these IO calls are causing a problem, I could see about putting in some kind of IO wait (configurable by the user) that basically slows down the reading part of the process.
p.s. Geoffrey - what is the cosmological size of your box? If it's above about 300 Mpc/h, load balancing is probably not necessary, which should roughly half the IO required.
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Geoffery, Have you tried increasing those two environment variables in your batch script? I've seen the same warnings and failures on pleaides with enzo, and I've been able to avoid them by increasing them to export MPI_BUFS_PER_PROC=64 export MPI_BUFS_PER_HOST=256 John On 09/23/2011 02:12 PM, Geoffrey So wrote:
Hi Stephen, Pragneshkumar
Let me know if I should move this discussion off the yt-dev list.
The box size is only 80 Mpc or 56 Mpc/h, so load balancing probably help with peak memory more than hinder, but could still try turning it off.
Would decreasing the amount of cores alleviate some of the symptoms? Or would that just increase the computation time too much? I haven't done any scaling tests on the parallelHF.
Incidentally, I've killed the job after seeing the following error, looks like mpi was having some buffer trouble, which could be caused by the heavy IO?: . . .
MPI WARNING: Could not allocate an internal buffer in the last 30 seconds on rank 503. Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST. MPI WARNING: Could not allocate an internal buffer in the last 30 seconds on rank 505. Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST. MPI WARNING: Could not allocate an internal buffer in the last 30 seconds on rank 509. Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST. MPI WARNING: Could not allocate an internal buffer in the last 30 seconds on rank 511. Try increasing MPI_BUFS_PER_PROC and/or MPI_BUFS_PER_HOST. Traceback (most recent call last): File "ParallelHaloProfiler.py", line 17, in <module> rearrange=True, safety=1.5, premerge=True) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/analysis_modules/halo_finding/halo_objects.py", line 1861, in __init__ root_points = self._subsample_points() File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/analysis_modules/halo_finding/halo_objects.py", line 1991, in _subsample_points root_points = self._mpi_concatenate_array_on_root_double(my_points[0]) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 187, in passage return func(self, data) File "/nics/b/home/gsiisg/NautilusYT/src/yt-hg/yt/utilities/parallel_tools/parallel_analysis_interface.py", line 794, in _mpi_concatenate_array_on_root_double data = na.concatenate((data, new_data)) ValueError: negative dimensions are not allowed
Can I get a refund of the SU on that job?
From G.S.
On Fri, Sep 23, 2011 at 10:51 AM, Stephen Skory
mailto:s@skory.us> wrote:Geoffrey,
> (1) How much data are read/written by program ? > - After all the particles (3200^3 of them) are read in they are linked with > a fortran KDtree if they satisfy some conditions. > > (2) How many parallel readers/writers are used by program ? > - It is reading using 512 cores from my submission script. The amount of > write to disk depends on the distribution of the particle haloes across > processors, if they exist across processors then there will be more files > written out by write_particle_lists. > (3) Do you use MPI_IO ? or something else ? > - Yes, the program uses mpi4py-1.2.2 installed in my home directory > The details of the code can be found at: > http://yt-project.org/doc/analysis_modules/running_halofinder.html#halo-find... > under the section "Parallel HOP"
The main thing you got wrong is that we do not use MPI_IO. The IO is done primarily through a custom HDF5 reader written in C, and each thread does its own reading.
The issue that Pragnesh is probably seeing, and what Geoffrey alludes to, is how load balancing is done. Because of the details of how Enzo stores its data, it is difficult to know where to send the data for load balancing it without reading it all in, first. Out of convenience, once the layout is established, the data is read in again (instead of distributed via communication), this time by the tasks that have been assigned the data. Furthermore, the data as assigned may come from several files, meaning that each task will be opening/closing multiple files multiple times.
If all these IO calls are causing a problem, I could see about putting in some kind of IO wait (configurable by the user) that basically slows down the reading part of the process.
p.s. Geoffrey - what is the cosmological size of your box? If it's above about 300 Mpc/h, load balancing is probably not necessary, which should roughly half the IO required.
-- Stephen Skory s@skory.us mailto:s@skory.us http://stephenskory.com/ 510.621.3687 tel:510.621.3687 (google voice) _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org mailto:Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
-- John Wise Assistant Professor of Physics Center for Relativistic Astrophysics, Georgia Tech
participants (3)
-
Geoffrey So
-
John Wise
-
Stephen Skory