Hi all, when I was trying to work on Mike's latest halo finder problem, I tried swapping out the total_mass sum for a .quantities on this line: https://bitbucket.org/yt_analysis/yt/src/73fa0ace10e3/yt/analysis_modules/ha... but doing that seriously changed the answers. Digging a bit deeper I see that for some reason, the same value is being fed in twice (I'm using two cores) to mpi_allreduce, which gives the wrong total_mass out. I see different values going into mpi_allreduce, and the correct total_mass, when I use the original summing method. This has led me to conclude that the .quantities is not being done correctly here - there's something wrong with the multi-level parallelism here I think... or something. Does anyone have any ideas of what could be going on? Thanks! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi Stephen, On Fri, Feb 24, 2012 at 7:25 PM, Stephen Skory <s@skory.us> wrote:
Hi all,
when I was trying to work on Mike's latest halo finder problem, I tried swapping out the total_mass sum for a .quantities on this line:
https://bitbucket.org/yt_analysis/yt/src/73fa0ace10e3/yt/analysis_modules/ha...
but doing that seriously changed the answers. Digging a bit deeper I see that for some reason, the same value is being fed in twice (I'm using two cores) to mpi_allreduce, which gives the wrong total_mass out. I see different values going into mpi_allreduce, and the correct total_mass, when I use the original summing method. This has led me to conclude that the .quantities is not being done correctly here - there's something wrong with the multi-level parallelism here I think... or something. Does anyone have any ideas of what could be going on? Thanks!
I'm not sure I understand -- is something wrong with *quantities-in-general*, or with the way it's being called in halo_objects? Can you investigate with a small dataset if the answer changes based on manually calculating versus using quantities in serial & parallel? -Matt
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
As a followup, this script (run on Mike's dataset): http://paste.yt-project.org/show/2191/ gives results of: 5.793e+14 5.793e+14 5.640e+14 2.956e-14 1.336e-02 While the accumulated roundoff error here is ~1% (which is something to be concerned about, and which I think we can start addressing by mandating dtypes in our calls to things like .sum()) it's still not the significant errors that are being reported. I think quantities are okay. (Which is good, because the testing system should have caught any problems that would be introduced with parallelism, and it did not.) Additionally, I tested in HaloFinder in both parallel and serial, and on line 2240 I inserted a print statement indicating the received total volume. I got the correct answer in serial and in parallel. Is this the issue you were seeing? Could you please provide a small, minimally viable sample script that can show the problem? -Matt On Fri, Feb 24, 2012 at 7:32 PM, Matthew Turk <matthewturk@gmail.com> wrote:
Hi Stephen,
On Fri, Feb 24, 2012 at 7:25 PM, Stephen Skory <s@skory.us> wrote:
Hi all,
when I was trying to work on Mike's latest halo finder problem, I tried swapping out the total_mass sum for a .quantities on this line:
https://bitbucket.org/yt_analysis/yt/src/73fa0ace10e3/yt/analysis_modules/ha...
but doing that seriously changed the answers. Digging a bit deeper I see that for some reason, the same value is being fed in twice (I'm using two cores) to mpi_allreduce, which gives the wrong total_mass out. I see different values going into mpi_allreduce, and the correct total_mass, when I use the original summing method. This has led me to conclude that the .quantities is not being done correctly here - there's something wrong with the multi-level parallelism here I think... or something. Does anyone have any ideas of what could be going on? Thanks!
I'm not sure I understand -- is something wrong with *quantities-in-general*, or with the way it's being called in halo_objects? Can you investigate with a small dataset if the answer changes based on manually calculating versus using quantities in serial & parallel?
-Matt
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt,
Could you please provide a small, minimally viable sample script that can show the problem?
Here's what I'm seeing, with a diff like this (on my branch's tip): http://paste.yt-project.org/show/2195/ using this script: http://paste.yt-project.org/show/2196/ on the RD0006 dataset of the Enzo_64 yt-workshop collection, I get this output using two cores: 1.01953344682e+17 9.8838266396e+16 1.03009963462e+17 9.8838266396e+16 The two first values are different, as they should be with slightly different numbers of particles in each half, but the second two numbers are identical on each core, which I think is wrong. What do you think? Thanks! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi Stephen, On Sat, Feb 25, 2012 at 6:52 PM, Stephen Skory <s@skory.us> wrote:
Hi Matt,
Could you please provide a small, minimally viable sample script that can show the problem?
Here's what I'm seeing, with a diff like this (on my branch's tip):
http://paste.yt-project.org/show/2195/
using this script:
http://paste.yt-project.org/show/2196/
on the RD0006 dataset of the Enzo_64 yt-workshop collection, I get this output using two cores:
1.01953344682e+17 9.8838266396e+16 1.03009963462e+17 9.8838266396e+16
The two first values are different, as they should be with slightly different numbers of particles in each half, but the second two numbers are identical on each core, which I think is wrong. What do you think?
Oh, I see what you're doing. Originally you were actually trying to make your OWN quantity, instead of using yt builtins. So yeah, this is wrong. What you want to be doing is calling the quantity on the base, full-domain region. What it's doing is actually sub-decomposing each of the self._data_source objects and then talking between the processors to get it; but, neither one gets the full set of components for their own data source. I'd recommend either: 1) Calling the total ParticleMassMsun quantity on the data source that covers all of the subdomains, pre-decomposition. 2) Doing the sum and then manually summing across processors, like you were doing. 3) Add on a new processor group to ensure the parallelism is split across one processor (yes, really) instead of allowing it to be decomposed across many. -Matt
Thanks!
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt,
Oh, I see what you're doing. Originally you were actually trying to make your OWN quantity, instead of using yt builtins. So yeah, this is wrong.
Ok, this makes sense now. Thanks for clearing it up! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
participants (2)
-
Matthew Turk
-
Stephen Skory