Question on Using Parallel Functionality
Hi everyone, I'm working on doing an analysis on the ~500 outputs I have from a simulation run so naturally I want to do it in parallel. The data lives on Pleiades where a single node has 32/64 Gb depending on the machine you pick. The general code structure is to take a dataset and compute the fluxes of multiple quantities binned in radius. Because the outputs are large, I'd like to load in 1 dataset per node but then use all 16 cores on the node for the radial flux calculations. To test my code, I'm using 2 smaller outputs of ~4.5 Gb each so they should easily fit on one node but I keep getting memory errors from pleiades. The code does run correctly on my laptop. I'm fairly certain I'm not setting up the code correctly with the different num_proc keywords so that it's trying to do the calculation on a single core instead of half the node. I've posted a paired down example of my code to pastebin that uses two outputs from enzo_comoslogy_plus dataset. The code is named "flux_test_parallel.py" and should run if put inside that dataset directory. The parallel portion of the code is preceded by a line of #'s. ( http://paste.yt-project.org/show/21/) Any advice for how to force the parallel structure to use the machine memory correctly or general pointers for this kind of script would be really appreciated! Thanks! Lauren
Hi Lauren,
How are you running this test script? With 16 MPI tasks or just 2?
-Nathan
On Fri, Jun 8, 2018 at 1:29 PM, Lauren Corlies
Hi everyone,
I'm working on doing an analysis on the ~500 outputs I have from a simulation run so naturally I want to do it in parallel. The data lives on Pleiades where a single node has 32/64 Gb depending on the machine you pick.
The general code structure is to take a dataset and compute the fluxes of multiple quantities binned in radius. Because the outputs are large, I'd like to load in 1 dataset per node but then use all 16 cores on the node for the radial flux calculations.
To test my code, I'm using 2 smaller outputs of ~4.5 Gb each so they should easily fit on one node but I keep getting memory errors from pleiades. The code does run correctly on my laptop. I'm fairly certain I'm not setting up the code correctly with the different num_proc keywords so that it's trying to do the calculation on a single core instead of half the node.
I've posted a paired down example of my code to pastebin that uses two outputs from enzo_comoslogy_plus dataset. The code is named "flux_test_parallel.py" and should run if put inside that dataset directory. The parallel portion of the code is preceded by a line of #'s. ( http://paste.yt-project.org/show/21/)
Any advice for how to force the parallel structure to use the machine memory correctly or general pointers for this kind of script would be really appreciated!
Thanks! Lauren
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
Hey Nathan,
On Pleiades, I've run it with mpiexec -np 16
On my laptop, I've been trying mpirun -np 4
--Lauren
On Fri, Jun 8, 2018 at 2:34 PM, Nathan Goldbaum
Hi Lauren,
How are you running this test script? With 16 MPI tasks or just 2?
-Nathan
On Fri, Jun 8, 2018 at 1:29 PM, Lauren Corlies
wrote: Hi everyone,
I'm working on doing an analysis on the ~500 outputs I have from a simulation run so naturally I want to do it in parallel. The data lives on Pleiades where a single node has 32/64 Gb depending on the machine you pick.
The general code structure is to take a dataset and compute the fluxes of multiple quantities binned in radius. Because the outputs are large, I'd like to load in 1 dataset per node but then use all 16 cores on the node for the radial flux calculations.
To test my code, I'm using 2 smaller outputs of ~4.5 Gb each so they should easily fit on one node but I keep getting memory errors from pleiades. The code does run correctly on my laptop. I'm fairly certain I'm not setting up the code correctly with the different num_proc keywords so that it's trying to do the calculation on a single core instead of half the node.
I've posted a paired down example of my code to pastebin that uses two outputs from enzo_comoslogy_plus dataset. The code is named "flux_test_parallel.py" and should run if put inside that dataset directory. The parallel portion of the code is preceded by a line of #'s. ( http://paste.yt-project.org/show/21/)
Any advice for how to force the parallel structure to use the machine memory correctly or general pointers for this kind of script would be really appreciated!
Thanks! Lauren
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
Ah ok, so in your code you're doing using the following parallel_objects
construction:
yt.parallel_objects(outs,2,storage=storage)
You're right, as you say in the comment that this splits up the work into
two groups of workers. I suspect what's happening is that, for example on
pleiades, each of the 8 MPI processes per group are doing a substantial
amount of I/O, creating arrays, etc, in a manner that is not parallelized.
That means you've effectively octupled the per-node memory requirements for
running your script.
Instead, it might be better to e.g. do I/O and set up the arrays for the
radial flux measurement only on one MPI process in each group of processes,
then *broadcast* the data to the other processes, and do the flux
calculation in parallel. Be sure that the data you're broadcasting can fit
in memory on the node though, since again all 8 processes are going to have
local copies of the data. It might also be possible to write a Cython
program that uses OpenMP to process the data in parallel using hardware
threads and avoid dealing with OpenMP at all. In that case I'd only launch
2 MPI processes, if you want to process the flux calculations on 8 threads.
Also I'm not sure offhand if the nested parallel_objects calls you're using
will work without explicitly setting up two sets of MPI communicators. It
looks like multilevel parallelism isn't documented, so I'd probably need to
dive into the parallel analysis framework in yt to see how that handles
what you're trying to do to be sure. We should probably ultimately add an
example of the sort of multilevel parallelism you're trying to do to the
docs....
On Fri, Jun 8, 2018 at 1:53 PM, Lauren Corlies
Hey Nathan,
On Pleiades, I've run it with mpiexec -np 16
On my laptop, I've been trying mpirun -np 4
--Lauren
On Fri, Jun 8, 2018 at 2:34 PM, Nathan Goldbaum
wrote: Hi Lauren,
How are you running this test script? With 16 MPI tasks or just 2?
-Nathan
On Fri, Jun 8, 2018 at 1:29 PM, Lauren Corlies
wrote: Hi everyone,
I'm working on doing an analysis on the ~500 outputs I have from a simulation run so naturally I want to do it in parallel. The data lives on Pleiades where a single node has 32/64 Gb depending on the machine you pick.
The general code structure is to take a dataset and compute the fluxes of multiple quantities binned in radius. Because the outputs are large, I'd like to load in 1 dataset per node but then use all 16 cores on the node for the radial flux calculations.
To test my code, I'm using 2 smaller outputs of ~4.5 Gb each so they should easily fit on one node but I keep getting memory errors from pleiades. The code does run correctly on my laptop. I'm fairly certain I'm not setting up the code correctly with the different num_proc keywords so that it's trying to do the calculation on a single core instead of half the node.
I've posted a paired down example of my code to pastebin that uses two outputs from enzo_comoslogy_plus dataset. The code is named "flux_test_parallel.py" and should run if put inside that dataset directory. The parallel portion of the code is preceded by a line of #'s. ( http://paste.yt-project.org/show/21/)
Any advice for how to force the parallel structure to use the machine memory correctly or general pointers for this kind of script would be really appreciated!
Thanks! Lauren
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
Thanks for the suggestion Nathan! I'll try broadcasting the data for the
radial fluxes. As for the nested parallel_objects calls, whenever you get
the chance to add an example would be great. I've gotten this far using the
docs and they've been really helpful in general!
But for now, maybe I'll remove the nested structure if it's not working how
we think it should be and just submit multiple jobs with the flux
measurements done in parallel.
On Fri, Jun 8, 2018 at 3:08 PM, Nathan Goldbaum
Ah ok, so in your code you're doing using the following parallel_objects construction:
yt.parallel_objects(outs,2,storage=storage)
You're right, as you say in the comment that this splits up the work into two groups of workers. I suspect what's happening is that, for example on pleiades, each of the 8 MPI processes per group are doing a substantial amount of I/O, creating arrays, etc, in a manner that is not parallelized. That means you've effectively octupled the per-node memory requirements for running your script.
Instead, it might be better to e.g. do I/O and set up the arrays for the radial flux measurement only on one MPI process in each group of processes, then *broadcast* the data to the other processes, and do the flux calculation in parallel. Be sure that the data you're broadcasting can fit in memory on the node though, since again all 8 processes are going to have local copies of the data. It might also be possible to write a Cython program that uses OpenMP to process the data in parallel using hardware threads and avoid dealing with OpenMP at all. In that case I'd only launch 2 MPI processes, if you want to process the flux calculations on 8 threads.
Also I'm not sure offhand if the nested parallel_objects calls you're using will work without explicitly setting up two sets of MPI communicators. It looks like multilevel parallelism isn't documented, so I'd probably need to dive into the parallel analysis framework in yt to see how that handles what you're trying to do to be sure. We should probably ultimately add an example of the sort of multilevel parallelism you're trying to do to the docs....
On Fri, Jun 8, 2018 at 1:53 PM, Lauren Corlies
wrote: Hey Nathan,
On Pleiades, I've run it with mpiexec -np 16
On my laptop, I've been trying mpirun -np 4
--Lauren
On Fri, Jun 8, 2018 at 2:34 PM, Nathan Goldbaum
wrote: Hi Lauren,
How are you running this test script? With 16 MPI tasks or just 2?
-Nathan
On Fri, Jun 8, 2018 at 1:29 PM, Lauren Corlies
wrote: Hi everyone,
I'm working on doing an analysis on the ~500 outputs I have from a simulation run so naturally I want to do it in parallel. The data lives on Pleiades where a single node has 32/64 Gb depending on the machine you pick.
The general code structure is to take a dataset and compute the fluxes of multiple quantities binned in radius. Because the outputs are large, I'd like to load in 1 dataset per node but then use all 16 cores on the node for the radial flux calculations.
To test my code, I'm using 2 smaller outputs of ~4.5 Gb each so they should easily fit on one node but I keep getting memory errors from pleiades. The code does run correctly on my laptop. I'm fairly certain I'm not setting up the code correctly with the different num_proc keywords so that it's trying to do the calculation on a single core instead of half the node.
I've posted a paired down example of my code to pastebin that uses two outputs from enzo_comoslogy_plus dataset. The code is named "flux_test_parallel.py" and should run if put inside that dataset directory. The parallel portion of the code is preceded by a line of #'s. ( http://paste.yt-project.org/show/21/)
Any advice for how to force the parallel structure to use the machine memory correctly or general pointers for this kind of script would be really appreciated!
Thanks! Lauren
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
Hi Lauren and Nathan, I have a couple of scripts with nested parallel_object calls that I can share. http://paste.yt-project.org/show/22/ http://paste.yt-project.org/show/23/ I wanted to calculate the column density in multiple directions from each star particle to the halo's surface, in order to calculate the ionizing radiation escape fraction. I first (dynamically) parallelized over halos and then over rays (192 of them). Paste #23 is my testing code to get situated with nested parallelism with yt, doing what I described, and it should be more clear than my production code (#22). But when I started the actual analysis code (#22), I needed to write the results to file. I couldn't let just the root processor write to disk because of the multi-parallelism, but I needed the root processor of the sub-communicator (see is_group_root() call) to write to disk. I ended up writing one file per sub-communicator, and then letting the root processor combine all of the files at the end. I could have use parallel hdf5 or something else more clever, but this did the job. Also unrelated to these scripts but related to parallel loops with many large datasets (>100k AMR grids), I've found it necessary to manually clean up the hierarchy metadata after finishing an iteration with the following commands to keep memory usage in check. ds.index.clear_all_data() del ds.index.grid_dimensions del ds.index.grid_left_edge del ds.index.grid_right_edge del ds.index.grid_levels del ds.index.grid_particle_count del ds.index.grids I hope this helps! Thanks, John On 6/8/2018 16:09, Lauren Corlies wrote:
Thanks for the suggestion Nathan! I'll try broadcasting the data for the radial fluxes. As for the nested parallel_objects calls, whenever you get the chance to add an example would be great. I've gotten this far using the docs and they've been really helpful in general!
But for now, maybe I'll remove the nested structure if it's not working how we think it should be and just submit multiple jobs with the flux measurements done in parallel.
On Fri, Jun 8, 2018 at 3:08 PM, Nathan Goldbaum
mailto:nathan12343@gmail.com> wrote: Ah ok, so in your code you're doing using the following parallel_objects construction:
yt.parallel_objects(outs,2,storage=storage)
You're right, as you say in the comment that this splits up the work into two groups of workers. I suspect what's happening is that, for example on pleiades, each of the 8 MPI processes per group are doing a substantial amount of I/O, creating arrays, etc, in a manner that is not parallelized. That means you've effectively octupled the per-node memory requirements for running your script.
Instead, it might be better to e.g. do I/O and set up the arrays for the radial flux measurement only on one MPI process in each group of processes, then *broadcast* the data to the other processes, and do the flux calculation in parallel. Be sure that the data you're broadcasting can fit in memory on the node though, since again all 8 processes are going to have local copies of the data. It might also be possible to write a Cython program that uses OpenMP to process the data in parallel using hardware threads and avoid dealing with OpenMP at all. In that case I'd only launch 2 MPI processes, if you want to process the flux calculations on 8 threads.
Also I'm not sure offhand if the nested parallel_objects calls you're using will work without explicitly setting up two sets of MPI communicators. It looks like multilevel parallelism isn't documented, so I'd probably need to dive into the parallel analysis framework in yt to see how that handles what you're trying to do to be sure. We should probably ultimately add an example of the sort of multilevel parallelism you're trying to do to the docs....
On Fri, Jun 8, 2018 at 1:53 PM, Lauren Corlies
mailto:laurennc009@gmail.com> wrote: Hey Nathan,
On Pleiades, I've run it with mpiexec -np 16
On my laptop, I've been trying mpirun -np 4
--Lauren
On Fri, Jun 8, 2018 at 2:34 PM, Nathan Goldbaum
mailto:nathan12343@gmail.com> wrote: Hi Lauren,
How are you running this test script? With 16 MPI tasks or just 2?
-Nathan
On Fri, Jun 8, 2018 at 1:29 PM, Lauren Corlies
mailto:laurennc009@gmail.com> wrote: Hi everyone,
I'm working on doing an analysis on the ~500 outputs I have from a simulation run so naturally I want to do it in parallel. The data lives on Pleiades where a single node has 32/64 Gb depending on the machine you pick.
The general code structure is to take a dataset and compute the fluxes of multiple quantities binned in radius. Because the outputs are large, I'd like to load in 1 dataset per node but then use all 16 cores on the node for the radial flux calculations.
To test my code, I'm using 2 smaller outputs of ~4.5 Gb each so they should easily fit on one node but I keep getting memory errors from pleiades. The code does run correctly on my laptop. I'm fairly certain I'm not setting up the code correctly with the different num_proc keywords so that it's trying to do the calculation on a single core instead of half the node.
I've posted a paired down example of my code to pastebin that uses two outputs from enzo_comoslogy_plus dataset. The code is named "flux_test_parallel.py" and should run if put inside that dataset directory. The parallel portion of the code is preceded by a line of #'s. (http://paste.yt-project.org/show/21/ http://paste.yt-project.org/show/21/)
Any advice for how to force the parallel structure to use the machine memory correctly or general pointers for this kind of script would be really appreciated!
Thanks! Lauren
_______________________________________________ yt-users mailing list -- yt-users@python.org mailto:yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org mailto:yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org mailto:yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org mailto:yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org mailto:yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org mailto:yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org mailto:yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org mailto:yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
-- John Wise Associate Professor of Physics Center for Relativistic Astrophysics, Georgia Tech http://cosmo.gatech.edu
I've created an issue to track adding a multilevel parallelism example to
the docs:
https://github.com/yt-project/yt/issues/1830
On Sat, Jun 9, 2018 at 12:32 AM, John Wise
Hi Lauren and Nathan,
I have a couple of scripts with nested parallel_object calls that I can share.
http://paste.yt-project.org/show/22/ http://paste.yt-project.org/show/23/
I wanted to calculate the column density in multiple directions from each star particle to the halo's surface, in order to calculate the ionizing radiation escape fraction. I first (dynamically) parallelized over halos and then over rays (192 of them).
Paste #23 is my testing code to get situated with nested parallelism with yt, doing what I described, and it should be more clear than my production code (#22).
But when I started the actual analysis code (#22), I needed to write the results to file. I couldn't let just the root processor write to disk because of the multi-parallelism, but I needed the root processor of the sub-communicator (see is_group_root() call) to write to disk. I ended up writing one file per sub-communicator, and then letting the root processor combine all of the files at the end. I could have use parallel hdf5 or something else more clever, but this did the job.
Also unrelated to these scripts but related to parallel loops with many large datasets (>100k AMR grids), I've found it necessary to manually clean up the hierarchy metadata after finishing an iteration with the following commands to keep memory usage in check.
ds.index.clear_all_data() del ds.index.grid_dimensions del ds.index.grid_left_edge del ds.index.grid_right_edge del ds.index.grid_levels del ds.index.grid_particle_count del ds.index.grids
I hope this helps!
Thanks, John
On 6/8/2018 16:09, Lauren Corlies wrote:
Thanks for the suggestion Nathan! I'll try broadcasting the data for the radial fluxes. As for the nested parallel_objects calls, whenever you get the chance to add an example would be great. I've gotten this far using the docs and they've been really helpful in general!
But for now, maybe I'll remove the nested structure if it's not working how we think it should be and just submit multiple jobs with the flux measurements done in parallel.
On Fri, Jun 8, 2018 at 3:08 PM, Nathan Goldbaum
mailto:nathan12343@gmail.com> wrote: Ah ok, so in your code you're doing using the following parallel_objects construction:
yt.parallel_objects(outs,2,storage=storage)
You're right, as you say in the comment that this splits up the work into two groups of workers. I suspect what's happening is that, for example on pleiades, each of the 8 MPI processes per group are doing a substantial amount of I/O, creating arrays, etc, in a manner that is not parallelized. That means you've effectively octupled the per-node memory requirements for running your script.
Instead, it might be better to e.g. do I/O and set up the arrays for the radial flux measurement only on one MPI process in each group of processes, then *broadcast* the data to the other processes, and do the flux calculation in parallel. Be sure that the data you're broadcasting can fit in memory on the node though, since again all 8 processes are going to have local copies of the data. It might also be possible to write a Cython program that uses OpenMP to process the data in parallel using hardware threads and avoid dealing with OpenMP at all. In that case I'd only launch 2 MPI processes, if you want to process the flux calculations on 8 threads.
Also I'm not sure offhand if the nested parallel_objects calls you're using will work without explicitly setting up two sets of MPI communicators. It looks like multilevel parallelism isn't documented, so I'd probably need to dive into the parallel analysis framework in yt to see how that handles what you're trying to do to be sure. We should probably ultimately add an example of the sort of multilevel parallelism you're trying to do to the docs....
On Fri, Jun 8, 2018 at 1:53 PM, Lauren Corlies
mailto:laurennc009@gmail.com> wrote: Hey Nathan,
On Pleiades, I've run it with mpiexec -np 16
On my laptop, I've been trying mpirun -np 4
--Lauren
On Fri, Jun 8, 2018 at 2:34 PM, Nathan Goldbaum
mailto:nathan12343@gmail.com> wrote: Hi Lauren,
How are you running this test script? With 16 MPI tasks or just 2?
-Nathan
On Fri, Jun 8, 2018 at 1:29 PM, Lauren Corlies
mailto:laurennc009@gmail.com> wrote: Hi everyone,
I'm working on doing an analysis on the ~500 outputs I have from a simulation run so naturally I want to do it in parallel. The data lives on Pleiades where a single node has 32/64 Gb depending on the machine you pick.
The general code structure is to take a dataset and compute the fluxes of multiple quantities binned in radius. Because the outputs are large, I'd like to load in 1 dataset per node but then use all 16 cores on the node for the radial flux calculations.
To test my code, I'm using 2 smaller outputs of ~4.5 Gb each so they should easily fit on one node but I keep getting memory errors from pleiades. The code does run correctly on my laptop. I'm fairly certain I'm not setting up the code correctly with the different num_proc keywords so that it's trying to do the calculation on a single core instead of half the node.
I've posted a paired down example of my code to pastebin that uses two outputs from enzo_comoslogy_plus dataset. The code is named "flux_test_parallel.py" and should run if put inside that dataset directory. The parallel portion of the code is preceded by a line of #'s. (http://paste.yt-project.org/show/21/ http://paste.yt-project.org/show/21/)
Any advice for how to force the parallel structure to use the machine memory correctly or general pointers for this kind of script would be really appreciated!
Thanks! Lauren
_______________________________________________ yt-users mailing list -- yt-users@python.org mailto:yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org mailto:yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org mailto:yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org mailto:yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org mailto:yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org mailto:yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org mailto:yt-users-leave@python.org
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
-- John Wise Associate Professor of Physics Center for Relativistic Astrophysics, Georgia Tech http://cosmo.gatech.edu
_______________________________________________ yt-users mailing list -- yt-users@python.org To unsubscribe send an email to yt-users-leave@python.org
participants (3)
-
John Wise
-
Lauren Corlies
-
Nathan Goldbaum