[yt-dev] RFC: (yt-3.0) Geometry handling redesign

24 Feb 2012

      Hi all,

This email is going to be long, but the changes I am proposing in it
will require feedback.  Next Monday at 1PM Eastern I'll be holding a
hangout on Google+, if you would like to join in and discuss it with
lower latency.  (Of course, if nobody shows up, that's okay too.)  We
don't really have a detailed review process for design decisions like
this, and I hope that this discussion will guide developing that kind
of review process in the future.

Everything described here is in the branch geometry_handling at
https://bitbucket.org/MatthewTurk/yt-refactor

= Background =

Geometry in yt has always been designed around two things:

1) Grid patches
2) Spatial location based on bounding boxes

This has presented a number of excellent opportunities for optimizing
patch-based AMR data.  However, it has posed challenges for supporting
particle-based datasets.  (Particles inside patch-based data are
somewhat easier, although if you want to see one of the worst pieces
of yt, look no further than FakeGridForParticles, which mocks up
particles as grids with varying x,y,z, so that existing
fluid-selection routines can be used.)  Furthermore, the state of
Octree codes is not good.  For relatively small (up to about 64)
domains in RAMSES, we can do analysis with relative ease -- volume
rendering, projections, fluid analysis, etc, all work very nicely.
However, what is required is a costly step of regridding octs at the
outset, to form grid patches with some efficiency.  This regridding is
not stored between runs, and it's not *efficient*.

This has worked so far, but it is showing its strain.  As it stands,
yt cannot directly support SPH codes without somehow making SPH look
like grids.  Octree codes can't be supported without making them look
like grid codes.  Both of these are very slow operations.  And on top
of that, the current state of how data is read in, even for grid
codes, is a bit gross.  There are a number of conditionals, mechanisms
for striding over data (i.e., "_get_grids()" which is called in
parallel, and iterates over grids, and sometimes you have to call
_get_point_indices and sometimes you don't), and probably worst of all
a lot of unnecessary memory duplication because arrays in
regions/spheres/etc are constructed by iterating over grids, filling
in a list of arrays, and then concatenating that list into a second
array.

Furthermore, several times recently the question has come up of
handling and supporting more generic geometries: spherical, polar,
cylindrical.  Unfortunately, because we're so heavily based around the
idea of grids as is, these are difficult to support.  All of our
collision detection has been designed with rectilinear surfaces in
mind.  That being said, I'm really quite happy with how yt *functions*
for patch-based grids.  I'm just not sure that as it stands it is
maintainable, and I think that addressing the geometry issues could
ultimately also help out a great deal with scalability.  (More on that
below.)

= Geometry Overhaul =

For the last couple weeks, I have been working on an overhaul of the
geometry system in yt, attempting to handle parallel striding in a
format-agnostic way, deal with generating fields in as simple a way as
possible, and also attempting to speed things up.  I'm going to
summarize below the changes that I've made, the new design, and the
current status.

== Design: Selectors ==

As it stands, yt data objects (such as Region, Sphere, etc) are mostly
thought of as conduits for data.  They apply a criteria to the domain
and select all of the fluid points or particles within a region or
that fit a specific non-spatial criteria.  I have extended this
concept to construct very lightweight objects called selectors,
similar to how ParticleIO is now conducted for Enzo.  These are
objects in Cython that accept either a set of bounding-box arrays or a
set of spatial points and decide if those points are included in a
given region.  This is essentially a combination of the
get_point_indices and get_cut_mask methods that currently exist.
These have all been designed to be GIL-free, so that in the future
they can be threaded using OpenMP (although it turns out not very much
time is spent on them.)  To date I have implemented spheres, regions,
disks, cutting planes, slices, ortho rays, and grid collections.  They
all give bitwise identical results to the old method and are almost
all substantially faster.  (Cutting planes are particularly better,
regions are slightly slower, and the others are all about 25-40%
faster.)

One primary advantage is that these can (and do) trivially provide
counting methods, masking methods, and point-evaluation methods.  So
we can use the same routines to select particles and individual points
(for collision detection) that we use for selecting grid cells.  This
will allow reuse of data object code with particle outputs, and is a
primary advantage of the refactor.

This also led to separating selection-based data containers from
constructed data containers.

== Design: Ditching the Hierarchy ==

(Hopefully that heading caught some eyes.)  In line with attempting to
reduce Enzo-isms in the code, I have been moving away from
specifically calling something a "hierarchy," and instead moving
toward referring to geometry handlers.  grids, grid_left_edges,
grid_right_edges, and so on will no longer be guaranteed to appear
hanging off these objects.  The new class structure:

GeometryHandler
GridGeometryHandler(GeometryHandler)
OctreeGeometryHandler(GeometryHandler)

and eventually, other methods.  The concept behind this is to isolate
those items that *must* require domain knowledge, mesh knowledge, or
particle/fluid distribution knowledge, and place those in the
GeometryHandler.  All items specific to patch based or block based AMR
will go in GridGeometryHandler and so on.  This will allow for
fundamentally different method of IO for octrees from grids, for
instance.  The geometry handlers are all required to implement the
following methods:

_read_selection(self, fields, data_object, chunk)
_chunk(self, data_object, chunking_style, ngz = 0)

I will describe chunks later.  Note that the first routine,
_read_selection, accepts the names of the fields, and the data object
that describes how to select which portions of a dataset participate
in a given selection.  For the specific case of Enzo (as I am still in
design phase, I have only been working with Enzo data, although I will
move on to FLASH and Boxlib next) read_selection calls a routine that
counts and masks grids, batches operations to files, and then returns
only the necessary data for a selection.  The data object does no
discarding of out-of-bounds or child-masked data.

== Design: Chunking ==

Currently, yt uses de facto chunking of data in the form of grids.
When Derived Quantities want to handle individual grids, one at a
time, they "preload" the data from whatever grids the
ParallelAnalysisInterface thinks they deserve.  These grids are
iterated over, and handled individually, then the result is combined
at the end.  Profiles do something similar.  However, both of these
are de facto, and not really designed.  They rely on calls to
semi-private functions on data objects, manually masking data, on and
on.

To get around this ugly spot, I have started working on redesigning IO
and data handling to function in "chunks."  In this method, data
objects will request that the geometry handler supply a set of chunks
(the function _chunk, described above, conducts this task.)  Chunks
are of the form (IO_unit, size), where IO_unit is only ever managed or
handled by _read_selection, above.

To start the Chunks shuffling over the output, the code calls
data_source.chunks(fields, chunking_style).  Right now only "spatial",
"io" and "all" are supported for chunking styles.  This corresponds to
spatially-oriented division, IO-conserving, and all-at-once (not
usually relevant.)  What happens next is kind of fun.  The chunks
function looks like this:

    def chunks(self, fields, chunking_style, **kwargs):
        for chunk in self.hierarchy._chunk(self, chunking_style, **kwargs):
            with self._chunked_read(chunk):
                self.get_data(fields)
                yield self

Note what it does here -- it actually yields *itself*.  However,
inside the chunked_read function, what happens is that the attributes
corresponding to the size, the current data source, and so on, are set
by the geometry handler (still called a hierarchy here.)  So, for
instance, execution might look like this:

for ds in my_obj.chunks(["Density"], "spatial"):
  print ds is my_obj
  print ds["Density"].size

The first line will actually print True, but the results from the
second one will be the size of (for instance) the grid it's currently
iterating over.  In this way, it becomes much easier to stride over
subsets of data.  Derived quantities now look like this:

        chunks = self._data_source.chunks([], chunking_style="io")
        for ds in parallel_objects(chunks, -1):
            rv = self.func(ds, *args, **kwargs)

It chunks things, evaluates and then stores intermediate results.

This is not meant to replace spatial decomposition in parallel jobs,
but it *is* designed to enable much easier and *mesh-neutral* division
of labor for parallelism and for IO.  If we were to call chunk on an
octree, it no longer has to make things look like grids; it just makes
them look like flattened arrays (unless you chunk over spatial, which
I haven't gotten into yet.)

Essentially, by making the method of subsetting and striding over
subsetted data more compartmentalized, the code becomes more clear and
more maintainable.

= Status =

As it stands, I am able to select data very nicely -- this means
slices, profiles, derived quantities all work.  However (see below)
particles still need some work and handling, and constructed data
objects (smoothed covering grids, projections) have yet to be ported.

= Open Questions =

However, the main reason I am writing -- when the code is not yet
fully functional! -- is to ask for advice and consideration on a few
points.

== Particle Types ==

In the past, we've talked about handling particle types differently
depending on the particle.  Because this is destined for yt-3.0, I
would like to propose that we break compatibility with old scripts in
favor of clarity in our particle types.  Specifically, rather than
doing:

obj["particle_position_x"][obj["particle_type"] == 1]

We do,

obj[("particle_positions_x", 1)]

(or obj[("particle_positions_x", "PopIICluster")] if you are using a
code that supports such expression.)

And if a field is found to be a particle type field, but without a
tuple, we simply raise a KeyError.  We can emulate old behavior with:

obj[("particle_position_x",None)]

But I am inclined to say that I would rather *mandate* particle type
selection, rather than merely allow it.  This will lead to much
greater clarity, I think.

[+-][01] on

== Particle Access ==

Currently, we access particles off a data_object:

sphere["particle_position_x"]

Do we want to keep this, or move to being much more explicit, that we
require sphere.particles["particle_position_x"] ?  I can see both
sides to this.  Explicit is good, and would allow us to be both more
clear with users and about fields (and with our IO methods) but
allowing old behavior encourages not having to necessarily worry about
if something is a fluid or a particle, which is a good thing I think.

== User Survey ==

Should we conduct a brief (only a handful of questions) polling users
for their feelings on the changes or potential changes to
nomenclature, particles, and so on?  We could use it also as a time to
identify other outstanding issues with yt.

= Conclusion =

Does this all make a sort of sense?

[+-][01] on the design presented so far

If you're interested in helping out, please get in touch.  I'm still
laying groundwork in some areas, but in other areas I would appreciate
help and advice moving forward.

Thanks to everyone who made it this far,

Matt

[yt-dev] RFC: (yt-3.0) Geometry handling redesign

Matthew Turk