Hi everyone, I am trying to improve the efficiency of my analysis script which calculates attributes of haloes, after watching the workshop video on YT parallelism I was motivated to give parallel_objects a try. I am basically trying to calculate, then output some properties of each haloes found by parallel HOP. It turns out that even if I just output the (DM particles) mass of each halo, I am missing halo(s). It doesn't matter if I run this in serial or parallel, I end up missing the same amount of haloes if I use parallel_objects() like: haloes = LoadHaloes(pf, HaloListname) for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage): to iterate over the haloes, and the problem goes away if I just switch to: for halo in haloes: I noticed this when I tried it on an 800 cube dataset with around 50k haloes, I only get 4k haloes in return, I then tried to narrow things down, and it ruled out the way I am calculating the attributes, because I can just output the mass from halo.total_mass() that was basically read in from the .h5 file and I'd end up missing halo using the parallel_objects. For 128 cube dataset with 85 haloes, I'd end up missing 3 and get 82 back, and for 64 cube dataset with 22 haloes, I'd get back 21 haloes. Has anyone else encountered this behavior or can confirm it? From G.S.
Hi Geoffrey, On Fri, Mar 23, 2012 at 4:14 AM, Geoffrey So <gsiisg@gmail.com> wrote:
Hi everyone,
I am trying to improve the efficiency of my analysis script which calculates attributes of haloes, after watching the workshop video on YT parallelism I was motivated to give parallel_objects a try. I am basically trying to calculate, then output some properties of each haloes found by parallel HOP. It turns out that even if I just output the (DM particles) mass of each halo, I am missing halo(s). It doesn't matter if I run this in serial or parallel, I end up missing the same amount of haloes if I use parallel_objects() like:
haloes = LoadHaloes(pf, HaloListname)
for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage):
to iterate over the haloes, and the problem goes away if I just switch to:
for halo in haloes:
I noticed this when I tried it on an 800 cube dataset with around 50k haloes, I only get 4k haloes in return, I then tried to narrow things down, and it ruled out the way I am calculating the attributes, because I can just output the mass from halo.total_mass() that was basically read in from the .h5 file and I'd end up missing halo using the parallel_objects. For 128 cube dataset with 85 haloes, I'd end up missing 3 and get 82 back, and for 64 cube dataset with 22 haloes, I'd get back 21 haloes.
Has anyone else encountered this behavior or can confirm it?
Stephen might be able to shed some light on this, but I think LoadHalos will pre-assign processors to the halo objects, whereas parallel_objects will operate independently of that, distributing halos first come, first server. But, I'd really like to see this functionality work. Stephen, do you think there's any way around this? Could we modify parallel_objects to be aware of the proc assigned to an object? -Matt
From G.S.
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Geoffrey,
haloes = LoadHaloes(pf, HaloListname)
for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage):
Can you paste the whole script? Thanks.
Stephen might be able to shed some light on this, but I think LoadHalos will pre-assign processors to the halo objects, whereas parallel_objects will operate independently of that, distributing halos first come, first server.
In fact, LoadHaloes should not work that way. Each task should have a full copy of the halos data, but initially only the data in the HopAnalysis.out file, of course, the particles are loaded on demand. I've been using something like what Geoffrey's trying to do for a while with no issue. I'm hoping maybe there's something in Geoffrey's script... but I've been wrong before. -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi, Originally I was outputting something like 17 or 18 columns of attributes of my ellipsoids associated with the haloes, but I've ripped them out to the bare essential of just out putting the mass of halo and a "0" for attribute to narrow down the problem. So in this script I did nothing with the ellipsoids. http://paste.yt-project.org/show/2250/ For each halo, the output should have for the first column the halo DM particle mass, the second column a zero. The output when using for halo in haloes: http://paste.yt-project.org/show/2251/ The output when using for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage): http://paste.yt-project.org/show/2252/ The original halo list from parallelHOP: http://paste.yt-project.org/show/2253/ So there's agreement of the number of haloes with the "for halo in haloes" method and disagreement with the parallel_objects() method. From G.S. in the DD0273_z5.00_halo_list.out file I have 24 lines, first two On Fri, Mar 23, 2012 at 8:24 AM, Stephen Skory <s@skory.us> wrote:
Hi Geoffrey,
haloes = LoadHaloes(pf, HaloListname)
for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage):
Can you paste the whole script? Thanks.
Stephen might be able to shed some light on this, but I think LoadHalos will pre-assign processors to the halo objects, whereas parallel_objects will operate independently of that, distributing halos first come, first server.
In fact, LoadHaloes should not work that way. Each task should have a full copy of the halos data, but initially only the data in the HopAnalysis.out file, of course, the particles are loaded on demand.
I've been using something like what Geoffrey's trying to do for a while with no issue. I'm hoping maybe there's something in Geoffrey's script... but I've been wrong before.
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
I've just thought of something, I did a pull from dev-yt to my fork of YT sometime last week (to get the functionality of parallel_objects and my ellipsoid stuff), so maybe I broke something, I'll re-run the thing with dev-yt and see if the error is still there. I recall I had to merge halo_objects.py manually and so maybe I screwed up something there. I wasn't suspecting the merge to be problematic because parallel HOP ran just fine. From G.S. On Fri, Mar 23, 2012 at 10:31 AM, Geoffrey So <gsiisg@gmail.com> wrote:
Hi,
Originally I was outputting something like 17 or 18 columns of attributes of my ellipsoids associated with the haloes, but I've ripped them out to the bare essential of just out putting the mass of halo and a "0" for attribute to narrow down the problem. So in this script I did nothing with the ellipsoids.
http://paste.yt-project.org/show/2250/
For each halo, the output should have for the first column the halo DM particle mass, the second column a zero.
The output when using for halo in haloes: http://paste.yt-project.org/show/2251/
The output when using for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage): http://paste.yt-project.org/show/2252/
The original halo list from parallelHOP: http://paste.yt-project.org/show/2253/
So there's agreement of the number of haloes with the "for halo in haloes" method and disagreement with the parallel_objects() method.
From G.S.
in the DD0273_z5.00_halo_list.out file I have 24 lines, first two
On Fri, Mar 23, 2012 at 8:24 AM, Stephen Skory <s@skory.us> wrote:
Hi Geoffrey,
haloes = LoadHaloes(pf, HaloListname)
for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage):
Can you paste the whole script? Thanks.
Stephen might be able to shed some light on this, but I think LoadHalos will pre-assign processors to the halo objects, whereas parallel_objects will operate independently of that, distributing halos first come, first server.
In fact, LoadHaloes should not work that way. Each task should have a full copy of the halos data, but initially only the data in the HopAnalysis.out file, of course, the particles are loaded on demand.
I've been using something like what Geoffrey's trying to do for a while with no issue. I'm hoping maybe there's something in Geoffrey's script... but I've been wrong before.
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
I just did a fresh install of dev-yt 2.4 with the latest install script and confirmed that the behavior is still the same. changeset: 5387:7ff85b5c7dcc branch: yt tag: tip parent: 5386:e08c15b9ef01 parent: 5385:a5af0cffb818 user: Matthew Turk <matthewturk@gmail.com> date: Thu Mar 22 16:51:23 2012 -0400 summary: Merging From G.S. On Fri, Mar 23, 2012 at 11:15 AM, Geoffrey So <gsiisg@gmail.com> wrote:
I've just thought of something, I did a pull from dev-yt to my fork of YT sometime last week (to get the functionality of parallel_objects and my ellipsoid stuff), so maybe I broke something, I'll re-run the thing with dev-yt and see if the error is still there. I recall I had to merge halo_objects.py manually and so maybe I screwed up something there. I wasn't suspecting the merge to be problematic because parallel HOP ran just fine.
From G.S.
On Fri, Mar 23, 2012 at 10:31 AM, Geoffrey So <gsiisg@gmail.com> wrote:
Hi,
Originally I was outputting something like 17 or 18 columns of attributes of my ellipsoids associated with the haloes, but I've ripped them out to the bare essential of just out putting the mass of halo and a "0" for attribute to narrow down the problem. So in this script I did nothing with the ellipsoids.
http://paste.yt-project.org/show/2250/
For each halo, the output should have for the first column the halo DM particle mass, the second column a zero.
The output when using for halo in haloes: http://paste.yt-project.org/show/2251/
The output when using for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage): http://paste.yt-project.org/show/2252/
The original halo list from parallelHOP: http://paste.yt-project.org/show/2253/
So there's agreement of the number of haloes with the "for halo in haloes" method and disagreement with the parallel_objects() method.
From G.S.
in the DD0273_z5.00_halo_list.out file I have 24 lines, first two
On Fri, Mar 23, 2012 at 8:24 AM, Stephen Skory <s@skory.us> wrote:
Hi Geoffrey,
haloes = LoadHaloes(pf, HaloListname)
for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage):
Can you paste the whole script? Thanks.
Stephen might be able to shed some light on this, but I think LoadHalos will pre-assign processors to the halo objects, whereas parallel_objects will operate independently of that, distributing halos first come, first server.
In fact, LoadHaloes should not work that way. Each task should have a full copy of the halos data, but initially only the data in the HopAnalysis.out file, of course, the particles are loaded on demand.
I've been using something like what Geoffrey's trying to do for a while with no issue. I'm hoping maybe there's something in Geoffrey's script... but I've been wrong before.
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
It seems there was a problem with mailing list, but I hope this email gets through. I can confirm Stephen's simple solution of not using halo.total_mass() (a float) and instead use the id of the halo for the line sto.result_id = halo.id will work. I now get all my haloes again! From G.S. On Fri, Mar 23, 2012 at 11:42 AM, Geoffrey So <gsiisg@gmail.com> wrote:
I just did a fresh install of dev-yt 2.4 with the latest install script and confirmed that the behavior is still the same.
changeset: 5387:7ff85b5c7dcc branch: yt tag: tip parent: 5386:e08c15b9ef01 parent: 5385:a5af0cffb818 user: Matthew Turk <matthewturk@gmail.com> date: Thu Mar 22 16:51:23 2012 -0400 summary: Merging
From G.S.
On Fri, Mar 23, 2012 at 11:15 AM, Geoffrey So <gsiisg@gmail.com> wrote:
I've just thought of something, I did a pull from dev-yt to my fork of YT sometime last week (to get the functionality of parallel_objects and my ellipsoid stuff), so maybe I broke something, I'll re-run the thing with dev-yt and see if the error is still there. I recall I had to merge halo_objects.py manually and so maybe I screwed up something there. I wasn't suspecting the merge to be problematic because parallel HOP ran just fine.
From G.S.
On Fri, Mar 23, 2012 at 10:31 AM, Geoffrey So <gsiisg@gmail.com> wrote:
Hi,
Originally I was outputting something like 17 or 18 columns of attributes of my ellipsoids associated with the haloes, but I've ripped them out to the bare essential of just out putting the mass of halo and a "0" for attribute to narrow down the problem. So in this script I did nothing with the ellipsoids.
http://paste.yt-project.org/show/2250/
For each halo, the output should have for the first column the halo DM particle mass, the second column a zero.
The output when using for halo in haloes: http://paste.yt-project.org/show/2251/
The output when using for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage): http://paste.yt-project.org/show/2252/
The original halo list from parallelHOP: http://paste.yt-project.org/show/2253/
So there's agreement of the number of haloes with the "for halo in haloes" method and disagreement with the parallel_objects() method.
From G.S.
in the DD0273_z5.00_halo_list.out file I have 24 lines, first two
On Fri, Mar 23, 2012 at 8:24 AM, Stephen Skory <s@skory.us> wrote:
Hi Geoffrey,
haloes = LoadHaloes(pf, HaloListname)
for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage):
Can you paste the whole script? Thanks.
Stephen might be able to shed some light on this, but I think LoadHalos will pre-assign processors to the halo objects, whereas parallel_objects will operate independently of that, distributing halos first come, first server.
In fact, LoadHaloes should not work that way. Each task should have a full copy of the halos data, but initially only the data in the HopAnalysis.out file, of course, the particles are loaded on demand.
I've been using something like what Geoffrey's trying to do for a while with no issue. I'm hoping maybe there's something in Geoffrey's script... but I've been wrong before.
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Geoffrey, It looks like the parallel_objects storage doesn't like it when result_id is a float. So you have this: for sto, halo in parallel_objects(haloes, num_procs, storage = my_storage): sto.result_id = halo.total_mass() sto.result = na.array([0]) And when I try this on my own data, I do see a deficit of items in my_storage. But I do this instead: for sto, halo in parallel_objects(haloes, 0, storage = my_storage): sto.result_id = halo.id sto.result = (halo.total_mass(), na.array([0])) I get the correct number of items in my_storage. I hope this helps! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
participants (3)
-
Geoffrey So
-
Matthew Turk
-
Stephen Skory