I think the way you've structured this conversation is right. There are halo properties that must be calculated from the particles, and ones that can be derived from the list halo properties you've already calculated. It's similar to native vs derived fields for the frontends in that a different halo finder will return different 'native' properties (where native is defined as a first halo finding run with particles) and that later runs will be cheap if they can just depend on precalculated fields.
That suggests that when we define halo properties we should make explicit what depends on the particles directly (hard), and what can just derived from of that (easy).
On Thu, Jul 12, 2012 at 9:15 AM, Matthew Turk firstname.lastname@example.org wrote:
This week at the HIPACC Summer School, Chris Moody and I have been looking at finishing up Rockstar (rockstar.googlecode.com) integration. The current status is that Rockstar *works*, and Chris has added on TimeSeries support for it, but it alas only works for single-mass simulations. (The author of Rockstar has provided some suggestions about how to add multi-mass support.)
Rockstar does have an interesting method of calculating halo properties. Specifically, you add in variables and whatnot to a struct and then insert your own routines into the C code, recompile, rerun, and get additional results.
The way we've set up Rockstar integration, we could actually supply a generic Python function, and have it fill in a set of values that are then calculated on a halo. I think this is where we want to go with the Rockstar stuff -- you specify a set of functions you want calculated on a halo, these get calculated, you get back the halo object and the particles are flushed from memory.
What this brings up in my mind is, why do we hold on to the particles at all? When you run the halo finders in yt, the returned halo objects keep all the particles that belong to each halo. This lets the user specify, later on, which analysis to run. But, it also leads to a *lot* of memory management, parallel-distribution of objects (in a way that's different than we do now), and really I think doesn't add a *huge* amount of value. Nearly all the remaining analysis that gets performed either requires you to dump the particle IDs to disk anyway (merger trees) or could be performed for very little cost inline.
So if we think about this the way Christine, Stephen and everyone was talking about the other day, imagine you had a halo finder that was run like this:
catalog = HaloFinder(ts, analysis = ["CenterOfMass", "MaximumRadii", "EllipsoidalParameters"])
The halo finder would go out, perform the analysis on each dataset in the time series, and then return to you the full tree. But where this would be different than before is that the halos would have the properties calculated ahead of time, and the particles themselves would be gone. They'd be far more manageable, and we could discard nearly *all* of the complexity in the existing halo objects.
You could write your halo analysis routines *nearly identically* to the way you'd write your field definitions:
def MaximumRadii(halo, pf): dx = halo["x"] - halo.properties["CenterOfMass"] ... return max_r
The downside would be that to add on any analysis that requires *raw particle data*, you would either have to write out the particle IDs or information to disk, or re-run the halo finder. But honestly, I think this would probably be just fine to do. And the vast improvements to the infrstructure would probably be worth it. For instance, we would probably no longer need to subclass the halo objects, and we would be able to get rid of "claiming" on a processor-by-processor basis.
Thoughts? Stephen, what do you think?
-Matt _______________________________________________ yt-dev mailing list email@example.com http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org