Why do halo objects need their particles?
Hi all, This week at the HIPACC Summer School, Chris Moody and I have been looking at finishing up Rockstar (rockstar.googlecode.com) integration. The current status is that Rockstar *works*, and Chris has added on TimeSeries support for it, but it alas only works for single-mass simulations. (The author of Rockstar has provided some suggestions about how to add multi-mass support.) Rockstar does have an interesting method of calculating halo properties. Specifically, you add in variables and whatnot to a struct and then insert your own routines into the C code, recompile, rerun, and get additional results. The way we've set up Rockstar integration, we could actually supply a generic Python function, and have it fill in a set of values that are then calculated on a halo. I think this is where we want to go with the Rockstar stuff -- you specify a set of functions you want calculated on a halo, these get calculated, you get back the halo object and the particles are flushed from memory. What this brings up in my mind is, why do we hold on to the particles at all? When you run the halo finders in yt, the returned halo objects keep all the particles that belong to each halo. This lets the user specify, later on, which analysis to run. But, it also leads to a *lot* of memory management, parallel-distribution of objects (in a way that's different than we do now), and really I think doesn't add a *huge* amount of value. Nearly all the remaining analysis that gets performed either requires you to dump the particle IDs to disk anyway (merger trees) or could be performed for very little cost inline. So if we think about this the way Christine, Stephen and everyone was talking about the other day, imagine you had a halo finder that was run like this: catalog = HaloFinder(ts, analysis = ["CenterOfMass", "MaximumRadii", "EllipsoidalParameters"]) The halo finder would go out, perform the analysis on each dataset in the time series, and then return to you the full tree. But where this would be different than before is that the halos would have the properties calculated ahead of time, and the particles themselves would be gone. They'd be far more manageable, and we could discard nearly *all* of the complexity in the existing halo objects. You could write your halo analysis routines *nearly identically* to the way you'd write your field definitions: def MaximumRadii(halo, pf): dx = halo["x"] - halo.properties["CenterOfMass"] ... return max_r The downside would be that to add on any analysis that requires *raw particle data*, you would either have to write out the particle IDs or information to disk, or re-run the halo finder. But honestly, I think this would probably be just fine to do. And the vast improvements to the infrstructure would probably be worth it. For instance, we would probably no longer need to subclass the halo objects, and we would be able to get rid of "claiming" on a processor-by-processor basis. Thoughts? Stephen, what do you think? -Matt
Hi Matt,
I think the way you've structured this conversation is right. There are
halo properties that must be calculated from the particles, and ones that
can be derived from the list halo properties you've already calculated.
It's similar to native vs derived fields for the frontends in that a
different halo finder will return different 'native' properties (where
native is defined as a first halo finding run with particles) and that
later runs will be cheap if they can just depend on precalculated fields.
That suggests that when we define halo properties we should make explicit
what depends on the particles directly (hard), and what can just derived
from of that (easy).
chris
On Thu, Jul 12, 2012 at 9:15 AM, Matthew Turk
Hi all,
This week at the HIPACC Summer School, Chris Moody and I have been looking at finishing up Rockstar (rockstar.googlecode.com) integration. The current status is that Rockstar *works*, and Chris has added on TimeSeries support for it, but it alas only works for single-mass simulations. (The author of Rockstar has provided some suggestions about how to add multi-mass support.)
Rockstar does have an interesting method of calculating halo properties. Specifically, you add in variables and whatnot to a struct and then insert your own routines into the C code, recompile, rerun, and get additional results.
The way we've set up Rockstar integration, we could actually supply a generic Python function, and have it fill in a set of values that are then calculated on a halo. I think this is where we want to go with the Rockstar stuff -- you specify a set of functions you want calculated on a halo, these get calculated, you get back the halo object and the particles are flushed from memory.
What this brings up in my mind is, why do we hold on to the particles at all? When you run the halo finders in yt, the returned halo objects keep all the particles that belong to each halo. This lets the user specify, later on, which analysis to run. But, it also leads to a *lot* of memory management, parallel-distribution of objects (in a way that's different than we do now), and really I think doesn't add a *huge* amount of value. Nearly all the remaining analysis that gets performed either requires you to dump the particle IDs to disk anyway (merger trees) or could be performed for very little cost inline.
So if we think about this the way Christine, Stephen and everyone was talking about the other day, imagine you had a halo finder that was run like this:
catalog = HaloFinder(ts, analysis = ["CenterOfMass", "MaximumRadii", "EllipsoidalParameters"])
The halo finder would go out, perform the analysis on each dataset in the time series, and then return to you the full tree. But where this would be different than before is that the halos would have the properties calculated ahead of time, and the particles themselves would be gone. They'd be far more manageable, and we could discard nearly *all* of the complexity in the existing halo objects.
You could write your halo analysis routines *nearly identically* to the way you'd write your field definitions:
def MaximumRadii(halo, pf): dx = halo["x"] - halo.properties["CenterOfMass"] ... return max_r
The downside would be that to add on any analysis that requires *raw particle data*, you would either have to write out the particle IDs or information to disk, or re-run the halo finder. But honestly, I think this would probably be just fine to do. And the vast improvements to the infrstructure would probably be worth it. For instance, we would probably no longer need to subclass the halo objects, and we would be able to get rid of "claiming" on a processor-by-processor basis.
Thoughts? Stephen, what do you think?
-Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Chris & Matt,
Thoughts? Stephen, what do you think?
This is an interesting way to go, I think it has potential to be a useful replacement. This is kind of like taking the way two point functions works and apply it to halos. There are some details on how this might work that we would need to explore. For example, although we wouldn't need halos to be 'claimed' by processors, we would need to think about how to parallelize these halo analysis functions when halo data is on more than one processor. -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
participants (3)
-
Christopher Moody
-
Matthew Turk
-
Stephen Skory