This email is going to be long, but the changes I am proposing in it will require feedback. Next Monday at 1PM Eastern I'll be holding a hangout on Google+, if you would like to join in and discuss it with lower latency. (Of course, if nobody shows up, that's okay too.) We don't really have a detailed review process for design decisions like this, and I hope that this discussion will guide developing that kind of review process in the future.
Everything described here is in the branch geometry_handling at https://bitbucket.org/MatthewTurk/yt-refactor
= Background =
Geometry in yt has always been designed around two things:
1) Grid patches 2) Spatial location based on bounding boxes
This has presented a number of excellent opportunities for optimizing patch-based AMR data. However, it has posed challenges for supporting particle-based datasets. (Particles inside patch-based data are somewhat easier, although if you want to see one of the worst pieces of yt, look no further than FakeGridForParticles, which mocks up particles as grids with varying x,y,z, so that existing fluid-selection routines can be used.) Furthermore, the state of Octree codes is not good. For relatively small (up to about 64) domains in RAMSES, we can do analysis with relative ease -- volume rendering, projections, fluid analysis, etc, all work very nicely. However, what is required is a costly step of regridding octs at the outset, to form grid patches with some efficiency. This regridding is not stored between runs, and it's not efficient.
This has worked so far, but it is showing its strain. As it stands, yt cannot directly support SPH codes without somehow making SPH look like grids. Octree codes can't be supported without making them look like grid codes. Both of these are very slow operations. And on top of that, the current state of how data is read in, even for grid codes, is a bit gross. There are a number of conditionals, mechanisms for striding over data (i.e., "_get_grids()" which is called in parallel, and iterates over grids, and sometimes you have to call _get_point_indices and sometimes you don't), and probably worst of all a lot of unnecessary memory duplication because arrays in regions/spheres/etc are constructed by iterating over grids, filling in a list of arrays, and then concatenating that list into a second array.
Furthermore, several times recently the question has come up of handling and supporting more generic geometries: spherical, polar, cylindrical. Unfortunately, because we're so heavily based around the idea of grids as is, these are difficult to support. All of our collision detection has been designed with rectilinear surfaces in mind. That being said, I'm really quite happy with how yt functions for patch-based grids. I'm just not sure that as it stands it is maintainable, and I think that addressing the geometry issues could ultimately also help out a great deal with scalability. (More on that below.)
= Geometry Overhaul =
For the last couple weeks, I have been working on an overhaul of the geometry system in yt, attempting to handle parallel striding in a format-agnostic way, deal with generating fields in as simple a way as possible, and also attempting to speed things up. I'm going to summarize below the changes that I've made, the new design, and the current status.
== Design: Selectors ==
As it stands, yt data objects (such as Region, Sphere, etc) are mostly thought of as conduits for data. They apply a criteria to the domain and select all of the fluid points or particles within a region or that fit a specific non-spatial criteria. I have extended this concept to construct very lightweight objects called selectors, similar to how ParticleIO is now conducted for Enzo. These are objects in Cython that accept either a set of bounding-box arrays or a set of spatial points and decide if those points are included in a given region. This is essentially a combination of the get_point_indices and get_cut_mask methods that currently exist. These have all been designed to be GIL-free, so that in the future they can be threaded using OpenMP (although it turns out not very much time is spent on them.) To date I have implemented spheres, regions, disks, cutting planes, slices, ortho rays, and grid collections. They all give bitwise identical results to the old method and are almost all substantially faster. (Cutting planes are particularly better, regions are slightly slower, and the others are all about 25-40% faster.)
One primary advantage is that these can (and do) trivially provide counting methods, masking methods, and point-evaluation methods. So we can use the same routines to select particles and individual points (for collision detection) that we use for selecting grid cells. This will allow reuse of data object code with particle outputs, and is a primary advantage of the refactor.
This also led to separating selection-based data containers from constructed data containers.
== Design: Ditching the Hierarchy ==
(Hopefully that heading caught some eyes.) In line with attempting to reduce Enzo-isms in the code, I have been moving away from specifically calling something a "hierarchy," and instead moving toward referring to geometry handlers. grids, grid_left_edges, grid_right_edges, and so on will no longer be guaranteed to appear hanging off these objects. The new class structure:
GeometryHandler GridGeometryHandler(GeometryHandler) OctreeGeometryHandler(GeometryHandler)
and eventually, other methods. The concept behind this is to isolate those items that must require domain knowledge, mesh knowledge, or particle/fluid distribution knowledge, and place those in the GeometryHandler. All items specific to patch based or block based AMR will go in GridGeometryHandler and so on. This will allow for fundamentally different method of IO for octrees from grids, for instance. The geometry handlers are all required to implement the following methods:
_read_selection(self, fields, data_object, chunk) _chunk(self, data_object, chunking_style, ngz = 0)
I will describe chunks later. Note that the first routine, _read_selection, accepts the names of the fields, and the data object that describes how to select which portions of a dataset participate in a given selection. For the specific case of Enzo (as I am still in design phase, I have only been working with Enzo data, although I will move on to FLASH and Boxlib next) read_selection calls a routine that counts and masks grids, batches operations to files, and then returns only the necessary data for a selection. The data object does no discarding of out-of-bounds or child-masked data.
== Design: Chunking ==
Currently, yt uses de facto chunking of data in the form of grids. When Derived Quantities want to handle individual grids, one at a time, they "preload" the data from whatever grids the ParallelAnalysisInterface thinks they deserve. These grids are iterated over, and handled individually, then the result is combined at the end. Profiles do something similar. However, both of these are de facto, and not really designed. They rely on calls to semi-private functions on data objects, manually masking data, on and on.
To get around this ugly spot, I have started working on redesigning IO and data handling to function in "chunks." In this method, data objects will request that the geometry handler supply a set of chunks (the function _chunk, described above, conducts this task.) Chunks are of the form (IO_unit, size), where IO_unit is only ever managed or handled by _read_selection, above.
To start the Chunks shuffling over the output, the code calls data_source.chunks(fields, chunking_style). Right now only "spatial", "io" and "all" are supported for chunking styles. This corresponds to spatially-oriented division, IO-conserving, and all-at-once (not usually relevant.) What happens next is kind of fun. The chunks function looks like this:
def chunks(self, fields, chunking_style, **kwargs): for chunk in self.hierarchy._chunk(self, chunking_style, **kwargs): with self._chunked_read(chunk): self.get_data(fields) yield self
Note what it does here -- it actually yields itself. However, inside the chunked_read function, what happens is that the attributes corresponding to the size, the current data source, and so on, are set by the geometry handler (still called a hierarchy here.) So, for instance, execution might look like this:
for ds in my_obj.chunks(["Density"], "spatial"): print ds is my_obj print ds["Density"].size
The first line will actually print True, but the results from the second one will be the size of (for instance) the grid it's currently iterating over. In this way, it becomes much easier to stride over subsets of data. Derived quantities now look like this:
chunks = self._data_source.chunks(, chunking_style="io") for ds in parallel_objects(chunks, -1): rv = self.func(ds, *args, **kwargs)
It chunks things, evaluates and then stores intermediate results.
This is not meant to replace spatial decomposition in parallel jobs, but it is designed to enable much easier and mesh-neutral division of labor for parallelism and for IO. If we were to call chunk on an octree, it no longer has to make things look like grids; it just makes them look like flattened arrays (unless you chunk over spatial, which I haven't gotten into yet.)
Essentially, by making the method of subsetting and striding over subsetted data more compartmentalized, the code becomes more clear and more maintainable.
= Status =
As it stands, I am able to select data very nicely -- this means slices, profiles, derived quantities all work. However (see below) particles still need some work and handling, and constructed data objects (smoothed covering grids, projections) have yet to be ported.
= Open Questions =
However, the main reason I am writing -- when the code is not yet fully functional! -- is to ask for advice and consideration on a few points.
== Particle Types ==
In the past, we've talked about handling particle types differently depending on the particle. Because this is destined for yt-3.0, I would like to propose that we break compatibility with old scripts in favor of clarity in our particle types. Specifically, rather than doing:
obj["particle_position_x"][obj["particle_type"] == 1]
(or obj[("particle_positions_x", "PopIICluster")] if you are using a code that supports such expression.)
And if a field is found to be a particle type field, but without a tuple, we simply raise a KeyError. We can emulate old behavior with:
But I am inclined to say that I would rather mandate particle type selection, rather than merely allow it. This will lead to much greater clarity, I think.
== Particle Access ==
Currently, we access particles off a data_object:
Do we want to keep this, or move to being much more explicit, that we require sphere.particles["particle_position_x"] ? I can see both sides to this. Explicit is good, and would allow us to be both more clear with users and about fields (and with our IO methods) but allowing old behavior encourages not having to necessarily worry about if something is a fluid or a particle, which is a good thing I think.
== User Survey ==
Should we conduct a brief (only a handful of questions) polling users for their feelings on the changes or potential changes to nomenclature, particles, and so on? We could use it also as a time to identify other outstanding issues with yt.
= Conclusion =
Does this all make a sort of sense?
[+-] on the design presented so far
If you're interested in helping out, please get in touch. I'm still laying groundwork in some areas, but in other areas I would appreciate help and advice moving forward.
Thanks to everyone who made it this far,