Hi Matt,

I think the way you've structured this conversation is right. There are halo properties that must be calculated from the particles, and ones that can be derived from the list halo properties you've already calculated. It's similar to native vs derived fields for the frontends in that a different halo finder will return different 'native' properties (where native is defined as a first halo finding run with particles) and that later runs will be cheap if they can just depend on precalculated fields. 

That suggests that when we define halo properties we should make explicit what depends on the particles directly (hard), and what can just derived from of that (easy).


On Thu, Jul 12, 2012 at 9:15 AM, Matthew Turk <matthewturk@gmail.com> wrote:
Hi all,

This week at the HIPACC Summer School, Chris Moody and I have been
looking at finishing up Rockstar (rockstar.googlecode.com)
integration.  The current status is that Rockstar *works*, and Chris
has added on TimeSeries support for it, but it alas only works for
single-mass simulations.  (The author of Rockstar has provided some
suggestions about how to add multi-mass support.)

Rockstar does have an interesting method of calculating halo
properties.  Specifically, you add in variables and whatnot to a
struct and then insert your own routines into the C code, recompile,
rerun, and get additional results.

The way we've set up Rockstar integration, we could actually supply a
generic Python function, and have it fill in a set of values that are
then calculated on a halo.  I think this is where we want to go with
the Rockstar stuff -- you specify a set of functions you want
calculated on a halo, these get calculated, you get back the halo
object and the particles are flushed from memory.

What this brings up in my mind is, why do we hold on to the particles
at all?  When you run the halo finders in yt, the returned halo
objects keep all the particles that belong to each halo.  This lets
the user specify, later on, which analysis to run.  But, it also leads
to a *lot* of memory management, parallel-distribution of objects (in
a way that's different than we do now), and really I think doesn't add
a *huge* amount of value.  Nearly all the remaining analysis that gets
performed either requires you to dump the particle IDs to disk anyway
(merger trees) or could be performed for very little cost inline.

So if we think about this the way Christine, Stephen and everyone was
talking about the other day, imagine you had a halo finder that was
run like this:

catalog = HaloFinder(ts, analysis = ["CenterOfMass", "MaximumRadii",

The halo finder would go out, perform the analysis on each dataset in
the time series, and then return to you the full tree.  But where this
would be different than before is that the halos would have the
properties calculated ahead of time, and the particles themselves
would be gone.  They'd be far more manageable, and we could discard
nearly *all* of the complexity in the existing halo objects.

You could write your halo analysis routines *nearly identically* to
the way you'd write your field definitions:

def MaximumRadii(halo, pf):
    dx = halo["x"] - halo.properties["CenterOfMass"]
    return max_r

The downside would be that to add on any analysis that requires *raw
particle data*, you would either have to write out the particle IDs or
information to disk, or re-run the halo finder.  But honestly, I think
this would probably be just fine to do.  And the vast improvements to
the infrstructure would probably be worth it.  For instance, we would
probably no longer need to subclass the halo objects, and we would be
able to get rid of "claiming" on a processor-by-processor basis.

Thoughts?  Stephen, what do you think?

yt-dev mailing list