Projection speed improvement patch
Hi guys, (For all of these performance indicators, I've used the 512^3 L7 amr-everywhere run called the "LightCone." This particular dataset has ~380,000 grids and is a great place to find the ) Last weekend I did a little bit of benchmarking and saw that the parallel projections (and likely several other parallel operations) all sat inside an MPI_Barrier for far too long. I converted (I think!) this process to be an MPI_Alltoallv operation, following on an MPI_Allreduce to get the final array size and the offsets into an ordered array, and I think it is working. I saw pretty good performance improvements, but it's tough to quantify those right now -- for projecting "Ones" (no disk-access) it sped things up by ~15%. I've also added a new binary hierarchy method to devel enzo, and it provides everything that is necessary for yt to analyze the data. As such, if a %(basename)s.harrays file exists, it will be used, and yt will not need to open the .hierarchy file at all. This sped things up by 100 seconds. I've written a script to create these (http://www.slac.stanford.edu/~mturk/create_harrays.py), but outputting them inline in Enzo is the fastest. To top this all off, I ran a projection -- start to finish, including all overhead -- on 16 processors. To project the fields "Density" (native), "Temperature" (native) and "VelocityMagnitude" (derived, requires x-, y- and z-velocity) on 16 processors to the finest resolution (adaptive projection -- to L7) takes 140 seconds, or roughly 2:20. I've looked at the profiling outputs, and it seems to me that there are still some places performance could be squeezed out. That being said, I'm pretty pleased with these results. These are all in the named branch hierarchy-opt in mercurial. They rely on some rearrangement of the hierarchy parsing and whatnot that has lived in hg for a little while; it will go into the trunk as soon as I get the all clear about moving to a proper stable/less-stable dev environment. I also have some other test suites to run on them, and I want to make sure the memory usage is not excessive. Best, Matt
Hi Matt,
This is awesome. I don't think anyone can expect much faster for that
dataset. I remember running projections just a year or so ago on this data
and it taking a whole lot more time (just reading in the data took ages).
What machine were you able to do this on? I'm mostly curious about the
memory it used, or had available to it.
In any case, I'd say this is a pretty big success, and the binary
hierarchies are a great idea.
Cheers,
Sam
On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk
Hi guys,
(For all of these performance indicators, I've used the 512^3 L7 amr-everywhere run called the "LightCone." This particular dataset has ~380,000 grids and is a great place to find the )
Last weekend I did a little bit of benchmarking and saw that the parallel projections (and likely several other parallel operations) all sat inside an MPI_Barrier for far too long. I converted (I think!) this process to be an MPI_Alltoallv operation, following on an MPI_Allreduce to get the final array size and the offsets into an ordered array, and I think it is working. I saw pretty good performance improvements, but it's tough to quantify those right now -- for projecting "Ones" (no disk-access) it sped things up by ~15%.
I've also added a new binary hierarchy method to devel enzo, and it provides everything that is necessary for yt to analyze the data. As such, if a %(basename)s.harrays file exists, it will be used, and yt will not need to open the .hierarchy file at all. This sped things up by 100 seconds. I've written a script to create these (http://www.slac.stanford.edu/~mturk/create_harrays.py), but outputting them inline in Enzo is the fastest.
To top this all off, I ran a projection -- start to finish, including all overhead -- on 16 processors. To project the fields "Density" (native), "Temperature" (native) and "VelocityMagnitude" (derived, requires x-, y- and z-velocity) on 16 processors to the finest resolution (adaptive projection -- to L7) takes 140 seconds, or roughly 2:20.
I've looked at the profiling outputs, and it seems to me that there are still some places performance could be squeezed out. That being said, I'm pretty pleased with these results.
These are all in the named branch hierarchy-opt in mercurial. They rely on some rearrangement of the hierarchy parsing and whatnot that has lived in hg for a little while; it will go into the trunk as soon as I get the all clear about moving to a proper stable/less-stable dev environment. I also have some other test suites to run on them, and I want to make sure the memory usage is not excessive.
Best,
Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
-- Samuel W. Skillman DOE Computational Science Graduate Fellow Center for Astrophysics and Space Astronomy University of Colorado at Boulder samuel.skillman[at]colorado.edu
Hi Sam, I guess you're right, without the info about the machine, this doesn't help much! This was running on a new machine at SLAC called 'orange-bigmem' -- it's a 32node machine with a ton of memory available to all the processors. I checked memory usage at the end of the run, and after the projection ahd been save out a few times it was around 1.5 gigs per node. I'm threading some outputs of the total memory usage through the projection code, and hopefully that will give us an idea of the peak memory usage. The file system is lustre, which works well with the preloading of the data, and I ran it a couple times beforehand to make sure that the files were in local cache or whatever. So the communication was via shared memory, which while still an MPI interface is much closer to ideal. I will be giving it a go on a cluster tomorrow, after I work out some kinks with data storage. I've moved the generation of the binary hierarchies into yt -- so if you don't have one, rather than dumping the hierarchy into the .yt file, it will dump it into the .harrays file. This way if anyone else writes an interface for the binary hierarchy method, we can all share it. (I think it would be a bad idea to have Enzo output a .yt file. ;-) The .yt file will now exist solely to store objects, not any of the hierarchy info. -Matt On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova@gmail.com> wrote:
Hi Matt, This is awesome. I don't think anyone can expect much faster for that dataset. I remember running projections just a year or so ago on this data and it taking a whole lot more time (just reading in the data took ages). What machine were you able to do this on? I'm mostly curious about the memory it used, or had available to it. In any case, I'd say this is a pretty big success, and the binary hierarchies are a great idea. Cheers, Sam On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk
wrote: Hi guys,
(For all of these performance indicators, I've used the 512^3 L7 amr-everywhere run called the "LightCone." This particular dataset has ~380,000 grids and is a great place to find the )
Last weekend I did a little bit of benchmarking and saw that the parallel projections (and likely several other parallel operations) all sat inside an MPI_Barrier for far too long. I converted (I think!) this process to be an MPI_Alltoallv operation, following on an MPI_Allreduce to get the final array size and the offsets into an ordered array, and I think it is working. I saw pretty good performance improvements, but it's tough to quantify those right now -- for projecting "Ones" (no disk-access) it sped things up by ~15%.
I've also added a new binary hierarchy method to devel enzo, and it provides everything that is necessary for yt to analyze the data. As such, if a %(basename)s.harrays file exists, it will be used, and yt will not need to open the .hierarchy file at all. This sped things up by 100 seconds. I've written a script to create these (http://www.slac.stanford.edu/~mturk/create_harrays.py), but outputting them inline in Enzo is the fastest.
To top this all off, I ran a projection -- start to finish, including all overhead -- on 16 processors. To project the fields "Density" (native), "Temperature" (native) and "VelocityMagnitude" (derived, requires x-, y- and z-velocity) on 16 processors to the finest resolution (adaptive projection -- to L7) takes 140 seconds, or roughly 2:20.
I've looked at the profiling outputs, and it seems to me that there are still some places performance could be squeezed out. That being said, I'm pretty pleased with these results.
These are all in the named branch hierarchy-opt in mercurial. They rely on some rearrangement of the hierarchy parsing and whatnot that has lived in hg for a little while; it will go into the trunk as soon as I get the all clear about moving to a proper stable/less-stable dev environment. I also have some other test suites to run on them, and I want to make sure the memory usage is not excessive.
Best,
Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
-- Samuel W. Skillman DOE Computational Science Graduate Fellow Center for Astrophysics and Space Astronomy University of Colorado at Boulder samuel.skillman[at]colorado.edu
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi guys,
I just wanted to report one more benchmark which might be interesting
for a couple of you. I ran the same test, with *one additional
projected field* on Triton, and it takes 3:45 to project the entire
512^3 L7 amr-everywhere dataset to the finest resolution, projecting
Density, Temperature, VelocityMagnitude (requires 3 fields) and
Gravitational_Potential. This is over ethernet, rather than shared
memory (I did not use the myrinet interconnect for this test) and it's
with an additional field -- so pretty good, I think.
There are some issues with processors lagging, but I think they are
not a big deal anymore!
-Matt
On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk
Hi Sam,
I guess you're right, without the info about the machine, this doesn't help much!
This was running on a new machine at SLAC called 'orange-bigmem' -- it's a 32node machine with a ton of memory available to all the processors. I checked memory usage at the end of the run, and after the projection ahd been save out a few times it was around 1.5 gigs per node. I'm threading some outputs of the total memory usage through the projection code, and hopefully that will give us an idea of the peak memory usage.
The file system is lustre, which works well with the preloading of the data, and I ran it a couple times beforehand to make sure that the files were in local cache or whatever.
So the communication was via shared memory, which while still an MPI interface is much closer to ideal. I will be giving it a go on a cluster tomorrow, after I work out some kinks with data storage. I've moved the generation of the binary hierarchies into yt -- so if you don't have one, rather than dumping the hierarchy into the .yt file, it will dump it into the .harrays file. This way if anyone else writes an interface for the binary hierarchy method, we can all share it. (I think it would be a bad idea to have Enzo output a .yt file. ;-) The .yt file will now exist solely to store objects, not any of the hierarchy info.
-Matt
On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova@gmail.com> wrote:
Hi Matt, This is awesome. I don't think anyone can expect much faster for that dataset. I remember running projections just a year or so ago on this data and it taking a whole lot more time (just reading in the data took ages). What machine were you able to do this on? I'm mostly curious about the memory it used, or had available to it. In any case, I'd say this is a pretty big success, and the binary hierarchies are a great idea. Cheers, Sam On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk
wrote: Hi guys,
(For all of these performance indicators, I've used the 512^3 L7 amr-everywhere run called the "LightCone." This particular dataset has ~380,000 grids and is a great place to find the )
Last weekend I did a little bit of benchmarking and saw that the parallel projections (and likely several other parallel operations) all sat inside an MPI_Barrier for far too long. I converted (I think!) this process to be an MPI_Alltoallv operation, following on an MPI_Allreduce to get the final array size and the offsets into an ordered array, and I think it is working. I saw pretty good performance improvements, but it's tough to quantify those right now -- for projecting "Ones" (no disk-access) it sped things up by ~15%.
I've also added a new binary hierarchy method to devel enzo, and it provides everything that is necessary for yt to analyze the data. As such, if a %(basename)s.harrays file exists, it will be used, and yt will not need to open the .hierarchy file at all. This sped things up by 100 seconds. I've written a script to create these (http://www.slac.stanford.edu/~mturk/create_harrays.py), but outputting them inline in Enzo is the fastest.
To top this all off, I ran a projection -- start to finish, including all overhead -- on 16 processors. To project the fields "Density" (native), "Temperature" (native) and "VelocityMagnitude" (derived, requires x-, y- and z-velocity) on 16 processors to the finest resolution (adaptive projection -- to L7) takes 140 seconds, or roughly 2:20.
I've looked at the profiling outputs, and it seems to me that there are still some places performance could be squeezed out. That being said, I'm pretty pleased with these results.
These are all in the named branch hierarchy-opt in mercurial. They rely on some rearrangement of the hierarchy parsing and whatnot that has lived in hg for a little while; it will go into the trunk as soon as I get the all clear about moving to a proper stable/less-stable dev environment. I also have some other test suites to run on them, and I want to make sure the memory usage is not excessive.
Best,
Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
-- Samuel W. Skillman DOE Computational Science Graduate Fellow Center for Astrophysics and Space Astronomy University of Colorado at Boulder samuel.skillman[at]colorado.edu
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt, That is great news! About two months ago, I tried doing full projections on a 768^3 AMR everywhere (10 levels) on ranger. But I've had problems running out of memory (never really bothered to check memory usage because you can't have interactive jobs ... to my knowledge). I was running with 256 cores (512GB RAM should be enough...). The I/O was taking forever, also. I ended up just doing projections of subvolumes. But I'll be sure to test your improved version (along with the .harrays file) and report back to the list! Thanks! John On 3 Nov 2009, at 15:43, Matthew Turk wrote:
Hi guys,
I just wanted to report one more benchmark which might be interesting for a couple of you. I ran the same test, with *one additional projected field* on Triton, and it takes 3:45 to project the entire 512^3 L7 amr-everywhere dataset to the finest resolution, projecting Density, Temperature, VelocityMagnitude (requires 3 fields) and Gravitational_Potential. This is over ethernet, rather than shared memory (I did not use the myrinet interconnect for this test) and it's with an additional field -- so pretty good, I think.
There are some issues with processors lagging, but I think they are not a big deal anymore!
-Matt
On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk
wrote: Hi Sam,
I guess you're right, without the info about the machine, this doesn't help much!
This was running on a new machine at SLAC called 'orange-bigmem' -- it's a 32node machine with a ton of memory available to all the processors. I checked memory usage at the end of the run, and after the projection ahd been save out a few times it was around 1.5 gigs per node. I'm threading some outputs of the total memory usage through the projection code, and hopefully that will give us an idea of the peak memory usage.
The file system is lustre, which works well with the preloading of the data, and I ran it a couple times beforehand to make sure that the files were in local cache or whatever.
So the communication was via shared memory, which while still an MPI interface is much closer to ideal. I will be giving it a go on a cluster tomorrow, after I work out some kinks with data storage. I've moved the generation of the binary hierarchies into yt -- so if you don't have one, rather than dumping the hierarchy into the .yt file, it will dump it into the .harrays file. This way if anyone else writes an interface for the binary hierarchy method, we can all share it. (I think it would be a bad idea to have Enzo output a .yt file. ;-) The .yt file will now exist solely to store objects, not any of the hierarchy info.
-Matt
On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova@gmail.com> wrote:
Hi Matt, This is awesome. I don't think anyone can expect much faster for that dataset. I remember running projections just a year or so ago on this data and it taking a whole lot more time (just reading in the data took ages). What machine were you able to do this on? I'm mostly curious about the memory it used, or had available to it. In any case, I'd say this is a pretty big success, and the binary hierarchies are a great idea. Cheers, Sam On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk
wrote: Hi guys,
(For all of these performance indicators, I've used the 512^3 L7 amr-everywhere run called the "LightCone." This particular dataset has ~380,000 grids and is a great place to find the )
Last weekend I did a little bit of benchmarking and saw that the parallel projections (and likely several other parallel operations) all sat inside an MPI_Barrier for far too long. I converted (I think!) this process to be an MPI_Alltoallv operation, following on an MPI_Allreduce to get the final array size and the offsets into an ordered array, and I think it is working. I saw pretty good performance improvements, but it's tough to quantify those right now -- for projecting "Ones" (no disk-access) it sped things up by ~15%.
I've also added a new binary hierarchy method to devel enzo, and it provides everything that is necessary for yt to analyze the data. As such, if a %(basename)s.harrays file exists, it will be used, and yt will not need to open the .hierarchy file at all. This sped things up by 100 seconds. I've written a script to create these (http://www.slac.stanford.edu/~mturk/create_harrays.py), but outputting them inline in Enzo is the fastest.
To top this all off, I ran a projection -- start to finish, including all overhead -- on 16 processors. To project the fields "Density" (native), "Temperature" (native) and "VelocityMagnitude" (derived, requires x-, y- and z-velocity) on 16 processors to the finest resolution (adaptive projection -- to L7) takes 140 seconds, or roughly 2:20.
I've looked at the profiling outputs, and it seems to me that there are still some places performance could be squeezed out. That being said, I'm pretty pleased with these results.
These are all in the named branch hierarchy-opt in mercurial. They rely on some rearrangement of the hierarchy parsing and whatnot that has lived in hg for a little while; it will go into the trunk as soon as I get the all clear about moving to a proper stable/less- stable dev environment. I also have some other test suites to run on them, and I want to make sure the memory usage is not excessive.
Best,
Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
-- Samuel W. Skillman DOE Computational Science Graduate Fellow Center for Astrophysics and Space Astronomy University of Colorado at Boulder samuel.skillman[at]colorado.edu
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Wow, John -- that's simply unacceptable performance on yt's part, both
from a memory and a timing standpoint.
I'd love to take a look at this data, so if you want to toss it my
way, please do so!
On Tue, Nov 3, 2009 at 7:02 PM, John Wise
Hi Matt,
That is great news! About two months ago, I tried doing full projections on a 768^3 AMR everywhere (10 levels) on ranger. But I've had problems running out of memory (never really bothered to check memory usage because you can't have interactive jobs ... to my knowledge). I was running with 256 cores (512GB RAM should be enough...). The I/O was taking forever, also. I ended up just doing projections of subvolumes.
But I'll be sure to test your improved version (along with the .harrays file) and report back to the list!
Thanks! John
On 3 Nov 2009, at 15:43, Matthew Turk wrote:
Hi guys,
I just wanted to report one more benchmark which might be interesting for a couple of you. I ran the same test, with *one additional projected field* on Triton, and it takes 3:45 to project the entire 512^3 L7 amr-everywhere dataset to the finest resolution, projecting Density, Temperature, VelocityMagnitude (requires 3 fields) and Gravitational_Potential. This is over ethernet, rather than shared memory (I did not use the myrinet interconnect for this test) and it's with an additional field -- so pretty good, I think.
There are some issues with processors lagging, but I think they are not a big deal anymore!
-Matt
On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk
wrote: Hi Sam,
I guess you're right, without the info about the machine, this doesn't help much!
This was running on a new machine at SLAC called 'orange-bigmem' -- it's a 32node machine with a ton of memory available to all the processors. I checked memory usage at the end of the run, and after the projection ahd been save out a few times it was around 1.5 gigs per node. I'm threading some outputs of the total memory usage through the projection code, and hopefully that will give us an idea of the peak memory usage.
The file system is lustre, which works well with the preloading of the data, and I ran it a couple times beforehand to make sure that the files were in local cache or whatever.
So the communication was via shared memory, which while still an MPI interface is much closer to ideal. I will be giving it a go on a cluster tomorrow, after I work out some kinks with data storage. I've moved the generation of the binary hierarchies into yt -- so if you don't have one, rather than dumping the hierarchy into the .yt file, it will dump it into the .harrays file. This way if anyone else writes an interface for the binary hierarchy method, we can all share it. (I think it would be a bad idea to have Enzo output a .yt file. ;-) The .yt file will now exist solely to store objects, not any of the hierarchy info.
-Matt
On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova@gmail.com> wrote:
Hi Matt, This is awesome. I don't think anyone can expect much faster for that dataset. I remember running projections just a year or so ago on this data and it taking a whole lot more time (just reading in the data took ages). What machine were you able to do this on? I'm mostly curious about the memory it used, or had available to it. In any case, I'd say this is a pretty big success, and the binary hierarchies are a great idea. Cheers, Sam On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk
wrote: Hi guys,
(For all of these performance indicators, I've used the 512^3 L7 amr-everywhere run called the "LightCone." This particular dataset has ~380,000 grids and is a great place to find the )
Last weekend I did a little bit of benchmarking and saw that the parallel projections (and likely several other parallel operations) all sat inside an MPI_Barrier for far too long. I converted (I think!) this process to be an MPI_Alltoallv operation, following on an MPI_Allreduce to get the final array size and the offsets into an ordered array, and I think it is working. I saw pretty good performance improvements, but it's tough to quantify those right now -- for projecting "Ones" (no disk-access) it sped things up by ~15%.
I've also added a new binary hierarchy method to devel enzo, and it provides everything that is necessary for yt to analyze the data. As such, if a %(basename)s.harrays file exists, it will be used, and yt will not need to open the .hierarchy file at all. This sped things up by 100 seconds. I've written a script to create these (http://www.slac.stanford.edu/~mturk/create_harrays.py), but outputting them inline in Enzo is the fastest.
To top this all off, I ran a projection -- start to finish, including all overhead -- on 16 processors. To project the fields "Density" (native), "Temperature" (native) and "VelocityMagnitude" (derived, requires x-, y- and z-velocity) on 16 processors to the finest resolution (adaptive projection -- to L7) takes 140 seconds, or roughly 2:20.
I've looked at the profiling outputs, and it seems to me that there are still some places performance could be squeezed out. That being said, I'm pretty pleased with these results.
These are all in the named branch hierarchy-opt in mercurial. They rely on some rearrangement of the hierarchy parsing and whatnot that has lived in hg for a little while; it will go into the trunk as soon as I get the all clear about moving to a proper stable/less-stable dev environment. I also have some other test suites to run on them, and I want to make sure the memory usage is not excessive.
Best,
Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
-- Samuel W. Skillman DOE Computational Science Graduate Fellow Center for Astrophysics and Space Astronomy University of Colorado at Boulder samuel.skillman[at]colorado.edu
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi John,
I ended up tracking down a bug in the newly refactored hierarchy (note
about that below) but I benchmarked projections of the code on Triton.
Unfortunately, the network disk was kind of dying, so the benchmarks
are more IO dominated than I think they should be -- I might give it a
go in a few days, if I notice the disk performing any better.
It takes 250 seconds to do the IO and roughly 400 seconds total on 32
processors. This is projecting all the way to L10, three fields (one
of which is a derived field, composite from three, so a net of six
fields are being read.) This means about 150 seconds for the
instantiation (which is now negligible) and the math. Some processors
even sat in the Alltoallv call for ~300 seconds (the upper left patch,
for instance, which only goes to L6) so I believe I can now assert
it's completely IO dominated. (Note on that below.)
Looking at the memory usage after each level was done, the
get_memory_usage() function -- which opens /proc/pid/shmem -- reports
that the maximum memory usage per task is 1.6 Gigs. The final L10
projection of the three fields + weight + position information takes
572Mb of space. Creating a 4098^2 (one pixel on either side for the
border!) image takes about 5 seconds. With the plotting refactor, I
anticipate this coming down, because right now it calls the
Pixelization routine too many times. (The pixelization routine takes
*less* time than the write-out-png routine!)
Keep in mind that the yt image making process, once you have
projected, is essentially free -- so once you do your projection (400
seconds on 32 processors with suboptimal disk) you can make infinite
images for free on, for instance, your laptop.
The two things I'm still not completely set on:
* The main bug I tracked down was that the grids were all off with
the new hierarchy, but not the old. If I changed from calculating dx
for all grids by (RE - LE)/dims to setting dx by Parent.dds /
refine_factor, it worked just fine. But looking at the values, I
don't see why this changed anything unless it also messed up the
integer coordinates or the inclusion of grids in regions. Anyway,
that now reproduces all the old behavior.
* There are too many calls to the IO routines; it only reads from
each file once, but for some reason it's calling the
"ReadMultipleGrids" routine more than the number of CPU files, which
means something is wrong. I'll possibly dig into this at a later
date.
Anyway, I'm pretty happy with this. :)
-Matt
On Tue, Nov 3, 2009 at 7:07 PM, Matthew Turk
Wow, John -- that's simply unacceptable performance on yt's part, both from a memory and a timing standpoint.
I'd love to take a look at this data, so if you want to toss it my way, please do so!
On Tue, Nov 3, 2009 at 7:02 PM, John Wise
wrote: Hi Matt,
That is great news! About two months ago, I tried doing full projections on a 768^3 AMR everywhere (10 levels) on ranger. But I've had problems running out of memory (never really bothered to check memory usage because you can't have interactive jobs ... to my knowledge). I was running with 256 cores (512GB RAM should be enough...). The I/O was taking forever, also. I ended up just doing projections of subvolumes.
But I'll be sure to test your improved version (along with the .harrays file) and report back to the list!
Thanks! John
On 3 Nov 2009, at 15:43, Matthew Turk wrote:
Hi guys,
I just wanted to report one more benchmark which might be interesting for a couple of you. I ran the same test, with *one additional projected field* on Triton, and it takes 3:45 to project the entire 512^3 L7 amr-everywhere dataset to the finest resolution, projecting Density, Temperature, VelocityMagnitude (requires 3 fields) and Gravitational_Potential. This is over ethernet, rather than shared memory (I did not use the myrinet interconnect for this test) and it's with an additional field -- so pretty good, I think.
There are some issues with processors lagging, but I think they are not a big deal anymore!
-Matt
On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk
wrote: Hi Sam,
I guess you're right, without the info about the machine, this doesn't help much!
This was running on a new machine at SLAC called 'orange-bigmem' -- it's a 32node machine with a ton of memory available to all the processors. I checked memory usage at the end of the run, and after the projection ahd been save out a few times it was around 1.5 gigs per node. I'm threading some outputs of the total memory usage through the projection code, and hopefully that will give us an idea of the peak memory usage.
The file system is lustre, which works well with the preloading of the data, and I ran it a couple times beforehand to make sure that the files were in local cache or whatever.
So the communication was via shared memory, which while still an MPI interface is much closer to ideal. I will be giving it a go on a cluster tomorrow, after I work out some kinks with data storage. I've moved the generation of the binary hierarchies into yt -- so if you don't have one, rather than dumping the hierarchy into the .yt file, it will dump it into the .harrays file. This way if anyone else writes an interface for the binary hierarchy method, we can all share it. (I think it would be a bad idea to have Enzo output a .yt file. ;-) The .yt file will now exist solely to store objects, not any of the hierarchy info.
-Matt
On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova@gmail.com> wrote:
Hi Matt, This is awesome. I don't think anyone can expect much faster for that dataset. I remember running projections just a year or so ago on this data and it taking a whole lot more time (just reading in the data took ages). What machine were you able to do this on? I'm mostly curious about the memory it used, or had available to it. In any case, I'd say this is a pretty big success, and the binary hierarchies are a great idea. Cheers, Sam On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk
wrote: Hi guys,
(For all of these performance indicators, I've used the 512^3 L7 amr-everywhere run called the "LightCone." This particular dataset has ~380,000 grids and is a great place to find the )
Last weekend I did a little bit of benchmarking and saw that the parallel projections (and likely several other parallel operations) all sat inside an MPI_Barrier for far too long. I converted (I think!) this process to be an MPI_Alltoallv operation, following on an MPI_Allreduce to get the final array size and the offsets into an ordered array, and I think it is working. I saw pretty good performance improvements, but it's tough to quantify those right now -- for projecting "Ones" (no disk-access) it sped things up by ~15%.
I've also added a new binary hierarchy method to devel enzo, and it provides everything that is necessary for yt to analyze the data. As such, if a %(basename)s.harrays file exists, it will be used, and yt will not need to open the .hierarchy file at all. This sped things up by 100 seconds. I've written a script to create these (http://www.slac.stanford.edu/~mturk/create_harrays.py), but outputting them inline in Enzo is the fastest.
To top this all off, I ran a projection -- start to finish, including all overhead -- on 16 processors. To project the fields "Density" (native), "Temperature" (native) and "VelocityMagnitude" (derived, requires x-, y- and z-velocity) on 16 processors to the finest resolution (adaptive projection -- to L7) takes 140 seconds, or roughly 2:20.
I've looked at the profiling outputs, and it seems to me that there are still some places performance could be squeezed out. That being said, I'm pretty pleased with these results.
These are all in the named branch hierarchy-opt in mercurial. They rely on some rearrangement of the hierarchy parsing and whatnot that has lived in hg for a little while; it will go into the trunk as soon as I get the all clear about moving to a proper stable/less-stable dev environment. I also have some other test suites to run on them, and I want to make sure the memory usage is not excessive.
Best,
Matt _______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
-- Samuel W. Skillman DOE Computational Science Graduate Fellow Center for Astrophysics and Space Astronomy University of Colorado at Boulder samuel.skillman[at]colorado.edu
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt, Thanks so much for taking a look at my data and examining the memory usage of analyzing a dataset of this size. I'll have to give it another shot on ranger. I can also see how I/O performance is on the Altix here at Princeton, which has a local RAID (just like red). You said that I could do projections on my laptop once the computation is done on a large machine. I know the projection structure is stored in the .yt file, but are the projected fields also stored in the .yt file? Or do I have to have the data on my laptop? Thanks again! John On 5 Nov 2009, at 20:01, Matthew Turk wrote:
Hi John,
I ended up tracking down a bug in the newly refactored hierarchy (note about that below) but I benchmarked projections of the code on Triton. Unfortunately, the network disk was kind of dying, so the benchmarks are more IO dominated than I think they should be -- I might give it a go in a few days, if I notice the disk performing any better.
It takes 250 seconds to do the IO and roughly 400 seconds total on 32 processors. This is projecting all the way to L10, three fields (one of which is a derived field, composite from three, so a net of six fields are being read.) This means about 150 seconds for the instantiation (which is now negligible) and the math. Some processors even sat in the Alltoallv call for ~300 seconds (the upper left patch, for instance, which only goes to L6) so I believe I can now assert it's completely IO dominated. (Note on that below.)
Looking at the memory usage after each level was done, the get_memory_usage() function -- which opens /proc/pid/shmem -- reports that the maximum memory usage per task is 1.6 Gigs. The final L10 projection of the three fields + weight + position information takes 572Mb of space. Creating a 4098^2 (one pixel on either side for the border!) image takes about 5 seconds. With the plotting refactor, I anticipate this coming down, because right now it calls the Pixelization routine too many times. (The pixelization routine takes *less* time than the write-out-png routine!)
Keep in mind that the yt image making process, once you have projected, is essentially free -- so once you do your projection (400 seconds on 32 processors with suboptimal disk) you can make infinite images for free on, for instance, your laptop.
The two things I'm still not completely set on:
* The main bug I tracked down was that the grids were all off with the new hierarchy, but not the old. If I changed from calculating dx for all grids by (RE - LE)/dims to setting dx by Parent.dds / refine_factor, it worked just fine. But looking at the values, I don't see why this changed anything unless it also messed up the integer coordinates or the inclusion of grids in regions. Anyway, that now reproduces all the old behavior. * There are too many calls to the IO routines; it only reads from each file once, but for some reason it's calling the "ReadMultipleGrids" routine more than the number of CPU files, which means something is wrong. I'll possibly dig into this at a later date.
Anyway, I'm pretty happy with this. :)
-Matt
On Tue, Nov 3, 2009 at 7:07 PM, Matthew Turk
wrote: Wow, John -- that's simply unacceptable performance on yt's part, both from a memory and a timing standpoint.
I'd love to take a look at this data, so if you want to toss it my way, please do so!
On Tue, Nov 3, 2009 at 7:02 PM, John Wise
wrote: Hi Matt,
That is great news! About two months ago, I tried doing full projections on a 768^3 AMR everywhere (10 levels) on ranger. But I've had problems running out of memory (never really bothered to check memory usage because you can't have interactive jobs ... to my knowledge). I was running with 256 cores (512GB RAM should be enough...). The I/O was taking forever, also. I ended up just doing projections of subvolumes.
But I'll be sure to test your improved version (along with the .harrays file) and report back to the list!
Thanks! John
On 3 Nov 2009, at 15:43, Matthew Turk wrote:
Hi guys,
I just wanted to report one more benchmark which might be interesting for a couple of you. I ran the same test, with *one additional projected field* on Triton, and it takes 3:45 to project the entire 512^3 L7 amr-everywhere dataset to the finest resolution, projecting Density, Temperature, VelocityMagnitude (requires 3 fields) and Gravitational_Potential. This is over ethernet, rather than shared memory (I did not use the myrinet interconnect for this test) and it's with an additional field -- so pretty good, I think.
There are some issues with processors lagging, but I think they are not a big deal anymore!
-Matt
On Mon, Nov 2, 2009 at 10:25 PM, Matthew Turk
wrote:
Hi Sam,
I guess you're right, without the info about the machine, this doesn't help much!
This was running on a new machine at SLAC called 'orange-bigmem' -- it's a 32node machine with a ton of memory available to all the processors. I checked memory usage at the end of the run, and after the projection ahd been save out a few times it was around 1.5 gigs per node. I'm threading some outputs of the total memory usage through the projection code, and hopefully that will give us an idea of the peak memory usage.
The file system is lustre, which works well with the preloading of the data, and I ran it a couple times beforehand to make sure that the files were in local cache or whatever.
So the communication was via shared memory, which while still an MPI interface is much closer to ideal. I will be giving it a go on a cluster tomorrow, after I work out some kinks with data storage. I've moved the generation of the binary hierarchies into yt -- so if you don't have one, rather than dumping the hierarchy into the .yt file, it will dump it into the .harrays file. This way if anyone else writes an interface for the binary hierarchy method, we can all share it. (I think it would be a bad idea to have Enzo output a .yt file. ;-) The .yt file will now exist solely to store objects, not any of the hierarchy info.
-Matt
On Mon, Nov 2, 2009 at 10:13 PM, Sam Skillman <72Nova@gmail.com> wrote:
Hi Matt, This is awesome. I don't think anyone can expect much faster for that dataset. I remember running projections just a year or so ago on this data and it taking a whole lot more time (just reading in the data took ages). What machine were you able to do this on? I'm mostly curious about the memory it used, or had available to it. In any case, I'd say this is a pretty big success, and the binary hierarchies are a great idea. Cheers, Sam On Mon, Nov 2, 2009 at 8:47 PM, Matthew Turk
wrote: > > Hi guys, > > (For all of these performance indicators, I've used the 512^3 L7 > amr-everywhere run called the "LightCone." This particular > dataset > has ~380,000 grids and is a great place to find the ) > > Last weekend I did a little bit of benchmarking and saw that the > parallel projections (and likely several other parallel > operations) > all sat inside an MPI_Barrier for far too long. I converted (I > think!) this process to be an MPI_Alltoallv operation, > following on an > MPI_Allreduce to get the final array size and the offsets into > an > ordered array, and I think it is working. I saw pretty good > performance improvements, but it's tough to quantify those > right now > -- for projecting "Ones" (no disk-access) it sped things up by > ~15%. > > I've also added a new binary hierarchy method to devel enzo, > and it > provides everything that is necessary for yt to analyze the > data. As > such, if a %(basename)s.harrays file exists, it will be used, > and yt > will not need to open the .hierarchy file at all. This sped > things up > by 100 seconds. I've written a script to create these > (http://www.slac.stanford.edu/~mturk/create_harrays.py), but > outputting them inline in Enzo is the fastest. > > To top this all off, I ran a projection -- start to finish, > including > all overhead -- on 16 processors. To project the fields > "Density" > (native), "Temperature" (native) and > "VelocityMagnitude" (derived, > requires x-, y- and z-velocity) on 16 processors to the finest > resolution (adaptive projection -- to L7) takes 140 seconds, or > roughly 2:20. > > I've looked at the profiling outputs, and it seems to me that > there > are still some places performance could be squeezed out. That > being > said, I'm pretty pleased with these results. > > These are all in the named branch hierarchy-opt in mercurial. > They > rely on some rearrangement of the hierarchy parsing and > whatnot that > has lived in hg for a little while; it will go into the trunk > as soon > as I get the all clear about moving to a proper stable/less- > stable dev > environment. I also have some other test suites to run on > them, and I > want to make sure the memory usage is not excessive. > > Best, > > Matt > _______________________________________________ > Yt-dev mailing list > Yt-dev@lists.spacepope.org > http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org -- Samuel W. Skillman DOE Computational Science Graduate Fellow Center for Astrophysics and Space Astronomy University of Colorado at Boulder samuel.skillman[at]colorado.edu
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
_______________________________________________ Yt-dev mailing list Yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi john,
Thanks so much for taking a look at my data and examining the memory usage of analyzing a dataset of this size. I'll have to give it another shot on ranger. I can also see how I/O performance is on the Altix here at Princeton, which has a local RAID (just like red).
Awesome. This is all with the hierarchy-opt branch in mercurial, but I think I am going to port it back to trunk in the very near future, now that I have tested it on a number of different datasets.
You said that I could do projections on my laptop once the computation is done on a large machine. I know the projection structure is stored in the .yt file, but are the projected fields also stored in the .yt file? Or do I have to have the data on my laptop?
All the fields are stored as well. Here's the .yt file for the projections I made of your data: [mjturk@login-4-0 RS0064]$ h5ls -r restart0064.yt /DataFields Dataset {36} /Projections Group /Projections/0 Group /Projections/0/Density_Density Dataset {9355404} /Projections/0/Temperature_Density Dataset {9355404} /Projections/0/VelocityMagnitude_Density Dataset {9355404} /Projections/0/pdx Dataset {9355404} /Projections/0/pdy Dataset {9355404} /Projections/0/px Dataset {9355404} /Projections/0/py Dataset {9355404} /Projections/0/weight_field_Density Dataset {9355404} So it's stored as DataType/Axis/Field. Here we are also storing the weight field, so that if you add a new field, the weight doesn't need to be projected again. To make a portable dataset of projections, you'll need the parameter file and either the .hierarchy or .harrays file (in the repo I have tried very hard to keep it so that the instantiation only touches those two files) and the .yt file, and you can do something like: pf = EnzoStaticOutput("restart0064", data_style="enzo_packed_3d") pc = PlotCollection(pf, center=[0.5, 0.5, 0.5]) pc.add_projection("Density",0,"Density") This is slightly wordier, in that you have to specify the data_style, but I think that can be addressed as well. HOWEVER, it occurred to me this morning that the FixedResolutionBuffer should be able to accept the .yt file by itself, without the parameter file or anything like that, and that maybe that should be used for portable projections. That would reduce the memory overhead, so I'm going to take a quick look into that this morning. -Matt
On 6 Nov 2009, at 10:35, Matthew Turk wrote:
Hi john,
Thanks so much for taking a look at my data and examining the memory usage of analyzing a dataset of this size. I'll have to give it another shot on ranger. I can also see how I/O performance is on the Altix here at Princeton, which has a local RAID (just like red).
Awesome. This is all with the hierarchy-opt branch in mercurial, but I think I am going to port it back to trunk in the very near future, now that I have tested it on a number of different datasets.
The parallel projections work well on ranger now. I ran on 32 cores just like you did on triton. I projected density, temperature, and electron fraction, all weighted by density for the same dataset I gave you. yt was using slightly more memory on ranger at 2.1GB/core, which isn't bad at all. This pushed me over the 2GB/core limit on ranger, so I had to use 8 cores/node instead of 16. However, it was slower by a factor of 2.5. It took 1084 seconds from start to finish (including all of the overhead). I had already created a binary hierarchy beforehand. Ranger in general is slow (I suspect its interconnect), so maybe it's just a "feature" of ranger. Somewhat related but -- The Alltoallv call was failing when I compiled mpi4py with openmpi, but this went away when I compiled it with mvapich. If you want to see where it failed, I put the trackback at http://paste.enzotools.org/show/253/ Once I get mpi4py working on the Altix here, I'll post some timings from that, as well. Cheers, John
Hi John,
yt was using slightly more memory on ranger at 2.1GB/core, which isn't bad at all. This pushed me over the 2GB/core limit on ranger, so I had to use 8 cores/node instead of 16.
Oh, hm, interesting...
However, it was slower by a factor of 2.5. It took 1084 seconds from start to finish (including all of the overhead). I had already created a binary hierarchy beforehand. Ranger in general is slow (I suspect its interconnect), so maybe it's just a "feature" of ranger.
Okay. So that is something that I wonder if we can improve -- particularly since you're already seeing that you need more cores to run anyway. Right now, the mechanism for reading data goes something like this: Projection: For each level: identify grids on this level read all grids for file in all_files_for_these_grids: H5Fopen(File) for each grid in this file: H5Dread(each data set for this grid) So for each file that appears on a given level, the corresponding CPU file is only H5Fopen'd once -- which, with large lustre systems, should help out. (However, it does do multiple, potentially very small, H5Dreads -- but I think we might be able to coalesce these the same way enzo (optionally) can, with the H5P_DATASET_XFER property type, since we're reading into void*'s that should exist through the entirety of the C function. However, one other option would be to allow the projections to preload the entire dataset, rather than just the files needed for that level. If we assume complete grid locality, then our level-by-level *could* have roughly (N_enzo_cpus)/(N_yt_cpus) * N_levels H5Fopens, but it could be a lot worse with the standard enzo load balancing. The projections parallelize by 2D domain decomp, and they define their regions right away. So if we were to preload the entire dataset, rather than level-by-level, we'd use more memory but we'd have fewer H5Fopen calls (which, again, I'm told are the most expensive part of lustre data access.) I've created a patch that handles both of these things, and it's in the hierarchy-opt branch as hash 801b378a22f7. Because this touches the C code, this requires a new install or develop (depending on how you installed the first time.) If you get a chance, could you let me know if this improves things? I think maybe modifying the exact buffer size inside yt/lagos/HDF5LightReader.c may adjust things as well. (I was unable to test if this worked any better, as triton was down...) You can change the mechanism for preloading by setting the argument preload_style to either "level" (currently the default") or "all" (where it loads the entire source that it "owns"). This can be passed through the call to add_projection: pc.add_projection("Density", 0, preload_style='all')
Somewhat related but -- The Alltoallv call was failing when I compiled mpi4py with openmpi, but this went away when I compiled it with mvapich. If
dang it. This looks like I'm just passing around arrays that are too big. I think for this I might need some help from other people about the right way to do this... Ideas, anybody? -Matt
dang it. This looks like I'm just passing around arrays that are too big. I think for this I might need some help from other people about the right way to do this... Ideas, anybody?
Hm, at the risk of replying to my own message, we might be able to do this by using a memmap onto the disk. But then we'd be assuming homogenous clusters, which honestly, I think we should just go ahead and assume. But I'd rather keep things off the disk if at all possible. I suppose when we're dealing with arrays of this size, though, it does become a bit prohibitive... -Matt
Matt,
Hm, at the risk of replying to my own message, we might be able to do this by using a memmap onto the disk. But then we'd be assuming homogenous clusters, which honestly, I think we should just go ahead and assume. But I'd rather keep things off the disk if at all possible. I suppose when we're dealing with arrays of this size, though, it does become a bit prohibitive...
Pardon my ignorance, but how is the Alltoallv used? In Parallel HOP, I communicate arrays that are large fractions of a GB in size with no problem using un-pickled methods. I looked and it appears that Alltoallv is a not pickling comm, is this correct? _______________________________________________________ sskory@physics.ucsd.edu o__ Stephen Skory http://physics.ucsd.edu/~sskory/ _.>/ _Graduate Student ________________________________(_)_\(_)_______________
Pardon my ignorance, but how is the Alltoallv used? In Parallel HOP, I communicate arrays that are large fractions of a GB in size with no problem using un-pickled methods. I looked and it appears that Alltoallv is a not pickling comm, is this correct?
Hi Stephen, No, you're totally right, it's used to communicate a single array via direct buffers at a time. So if you're projecting Density and Temperature, the alltoall would be something like: calculate_sizes calculate_offsets for field in "px" "py" "pdx" "pdy" "Density" "Temperature: create_full_array alltoall(sizes, offsets, field, full_array) I'm not sure what the deal is with OpenMPI on Ranger. I'll see if I can replicate this and dig a bit deeper on Ranger tomorrow... -Matt
I've created a patch that handles both of these things, and it's in the hierarchy-opt branch as hash 801b378a22f7. Because this touches the C code, this requires a new install or develop (depending on how you installed the first time.) If you get a chance, could you let me know if this improves things? I think maybe modifying the exact buffer size inside yt/lagos/HDF5LightReader.c may adjust things as well.
Triton came back up, and I've been testing this, and for some reason the preload_style='all' is not working over here. I'm going to take a look at it. I also discovered that ACC_RDONLY only works with the HDF_CORE driver for HDF5-1.8, so I'll see about inserting some check against the API for that in the code. -Matt
On 8 Nov 2009, at 19:06, Matthew Turk wrote:
I'm not sure what the deal is with OpenMPI on Ranger. I'll see if I can replicate this and dig a bit deeper on Ranger tomorrow...
I tested some more on ranger, and I think it's ranger's fault. parallel projections work with mvapich+icc and openmpi+icc, but fails with openmpi+gcc. When I tried to start a batch job on ranger with "ibrun tacc_affinity" while using openmp+gcc, tacc_affinity fails. So python is never called. I can get around this if I don't call tacc_affinity to control the processor / code placement, i.e. ibrun mpi4py my_script.py --parallel This is when I get the errors I posted. I'm not sure whether this is worth debugging on yt's side because I think ranger is at fault here. Sorry about the false alarm! But I guess it's good to note this. John
ibrun mpi4py my_script.py --parallel
This is when I get the errors I posted. I'm not sure whether this is worth debugging on yt's side because I think ranger is at fault here.
Sorry about the false alarm! But I guess it's good to note this.
Oh, hm, interesting -- thanks for the resolution! I played around a bit, and in my tests, the H5Pcreate call actually added a substantial amount of time to the running time. And preloading the entire hierarchy popped the memory usage up by quite a bit. One thing that I'm finding disturbing is that even though I am preloading, somehow the ReadData (which does a set at a time) gets called for the weight field. This shouldn't be happening, so I'm going to try to track this down in smaller runs... The ReadMultipleDataSets routine performs pretty well, so if I can get rid of the ReadData, the overall improvement should be pretty good. (For reference, on the mirage/lustre filesystem on Triton, reading a single field one-at-a-time from the grids on processor 1 for the light cone took 40 seconds. Reading six fields, only calling H5Fopen once per cpu file, took 30 seconds. Yowza. I think those 40 seconds are unnecessary and can and should be eliminated.) -Matt
participants (4)
-
John Wise
-
Matthew Turk
-
Sam Skillman
-
Stephen Skory