RFC: (yt-3.0) Geometry handling redesign
Hi all, This email is going to be long, but the changes I am proposing in it will require feedback. Next Monday at 1PM Eastern I'll be holding a hangout on Google+, if you would like to join in and discuss it with lower latency. (Of course, if nobody shows up, that's okay too.) We don't really have a detailed review process for design decisions like this, and I hope that this discussion will guide developing that kind of review process in the future. Everything described here is in the branch geometry_handling at https://bitbucket.org/MatthewTurk/yt-refactor = Background = Geometry in yt has always been designed around two things: 1) Grid patches 2) Spatial location based on bounding boxes This has presented a number of excellent opportunities for optimizing patch-based AMR data. However, it has posed challenges for supporting particle-based datasets. (Particles inside patch-based data are somewhat easier, although if you want to see one of the worst pieces of yt, look no further than FakeGridForParticles, which mocks up particles as grids with varying x,y,z, so that existing fluid-selection routines can be used.) Furthermore, the state of Octree codes is not good. For relatively small (up to about 64) domains in RAMSES, we can do analysis with relative ease -- volume rendering, projections, fluid analysis, etc, all work very nicely. However, what is required is a costly step of regridding octs at the outset, to form grid patches with some efficiency. This regridding is not stored between runs, and it's not *efficient*. This has worked so far, but it is showing its strain. As it stands, yt cannot directly support SPH codes without somehow making SPH look like grids. Octree codes can't be supported without making them look like grid codes. Both of these are very slow operations. And on top of that, the current state of how data is read in, even for grid codes, is a bit gross. There are a number of conditionals, mechanisms for striding over data (i.e., "_get_grids()" which is called in parallel, and iterates over grids, and sometimes you have to call _get_point_indices and sometimes you don't), and probably worst of all a lot of unnecessary memory duplication because arrays in regions/spheres/etc are constructed by iterating over grids, filling in a list of arrays, and then concatenating that list into a second array. Furthermore, several times recently the question has come up of handling and supporting more generic geometries: spherical, polar, cylindrical. Unfortunately, because we're so heavily based around the idea of grids as is, these are difficult to support. All of our collision detection has been designed with rectilinear surfaces in mind. That being said, I'm really quite happy with how yt *functions* for patch-based grids. I'm just not sure that as it stands it is maintainable, and I think that addressing the geometry issues could ultimately also help out a great deal with scalability. (More on that below.) = Geometry Overhaul = For the last couple weeks, I have been working on an overhaul of the geometry system in yt, attempting to handle parallel striding in a format-agnostic way, deal with generating fields in as simple a way as possible, and also attempting to speed things up. I'm going to summarize below the changes that I've made, the new design, and the current status. == Design: Selectors == As it stands, yt data objects (such as Region, Sphere, etc) are mostly thought of as conduits for data. They apply a criteria to the domain and select all of the fluid points or particles within a region or that fit a specific non-spatial criteria. I have extended this concept to construct very lightweight objects called selectors, similar to how ParticleIO is now conducted for Enzo. These are objects in Cython that accept either a set of bounding-box arrays or a set of spatial points and decide if those points are included in a given region. This is essentially a combination of the get_point_indices and get_cut_mask methods that currently exist. These have all been designed to be GIL-free, so that in the future they can be threaded using OpenMP (although it turns out not very much time is spent on them.) To date I have implemented spheres, regions, disks, cutting planes, slices, ortho rays, and grid collections. They all give bitwise identical results to the old method and are almost all substantially faster. (Cutting planes are particularly better, regions are slightly slower, and the others are all about 25-40% faster.) One primary advantage is that these can (and do) trivially provide counting methods, masking methods, and point-evaluation methods. So we can use the same routines to select particles and individual points (for collision detection) that we use for selecting grid cells. This will allow reuse of data object code with particle outputs, and is a primary advantage of the refactor. This also led to separating selection-based data containers from constructed data containers. == Design: Ditching the Hierarchy == (Hopefully that heading caught some eyes.) In line with attempting to reduce Enzo-isms in the code, I have been moving away from specifically calling something a "hierarchy," and instead moving toward referring to geometry handlers. grids, grid_left_edges, grid_right_edges, and so on will no longer be guaranteed to appear hanging off these objects. The new class structure: GeometryHandler GridGeometryHandler(GeometryHandler) OctreeGeometryHandler(GeometryHandler) and eventually, other methods. The concept behind this is to isolate those items that *must* require domain knowledge, mesh knowledge, or particle/fluid distribution knowledge, and place those in the GeometryHandler. All items specific to patch based or block based AMR will go in GridGeometryHandler and so on. This will allow for fundamentally different method of IO for octrees from grids, for instance. The geometry handlers are all required to implement the following methods: _read_selection(self, fields, data_object, chunk) _chunk(self, data_object, chunking_style, ngz = 0) I will describe chunks later. Note that the first routine, _read_selection, accepts the names of the fields, and the data object that describes how to select which portions of a dataset participate in a given selection. For the specific case of Enzo (as I am still in design phase, I have only been working with Enzo data, although I will move on to FLASH and Boxlib next) read_selection calls a routine that counts and masks grids, batches operations to files, and then returns only the necessary data for a selection. The data object does no discarding of out-of-bounds or child-masked data. == Design: Chunking == Currently, yt uses de facto chunking of data in the form of grids. When Derived Quantities want to handle individual grids, one at a time, they "preload" the data from whatever grids the ParallelAnalysisInterface thinks they deserve. These grids are iterated over, and handled individually, then the result is combined at the end. Profiles do something similar. However, both of these are de facto, and not really designed. They rely on calls to semi-private functions on data objects, manually masking data, on and on. To get around this ugly spot, I have started working on redesigning IO and data handling to function in "chunks." In this method, data objects will request that the geometry handler supply a set of chunks (the function _chunk, described above, conducts this task.) Chunks are of the form (IO_unit, size), where IO_unit is only ever managed or handled by _read_selection, above. To start the Chunks shuffling over the output, the code calls data_source.chunks(fields, chunking_style). Right now only "spatial", "io" and "all" are supported for chunking styles. This corresponds to spatially-oriented division, IO-conserving, and all-at-once (not usually relevant.) What happens next is kind of fun. The chunks function looks like this: def chunks(self, fields, chunking_style, **kwargs): for chunk in self.hierarchy._chunk(self, chunking_style, **kwargs): with self._chunked_read(chunk): self.get_data(fields) yield self Note what it does here -- it actually yields *itself*. However, inside the chunked_read function, what happens is that the attributes corresponding to the size, the current data source, and so on, are set by the geometry handler (still called a hierarchy here.) So, for instance, execution might look like this: for ds in my_obj.chunks(["Density"], "spatial"): print ds is my_obj print ds["Density"].size The first line will actually print True, but the results from the second one will be the size of (for instance) the grid it's currently iterating over. In this way, it becomes much easier to stride over subsets of data. Derived quantities now look like this: chunks = self._data_source.chunks([], chunking_style="io") for ds in parallel_objects(chunks, -1): rv = self.func(ds, *args, **kwargs) It chunks things, evaluates and then stores intermediate results. This is not meant to replace spatial decomposition in parallel jobs, but it *is* designed to enable much easier and *mesh-neutral* division of labor for parallelism and for IO. If we were to call chunk on an octree, it no longer has to make things look like grids; it just makes them look like flattened arrays (unless you chunk over spatial, which I haven't gotten into yet.) Essentially, by making the method of subsetting and striding over subsetted data more compartmentalized, the code becomes more clear and more maintainable. = Status = As it stands, I am able to select data very nicely -- this means slices, profiles, derived quantities all work. However (see below) particles still need some work and handling, and constructed data objects (smoothed covering grids, projections) have yet to be ported. = Open Questions = However, the main reason I am writing -- when the code is not yet fully functional! -- is to ask for advice and consideration on a few points. == Particle Types == In the past, we've talked about handling particle types differently depending on the particle. Because this is destined for yt-3.0, I would like to propose that we break compatibility with old scripts in favor of clarity in our particle types. Specifically, rather than doing: obj["particle_position_x"][obj["particle_type"] == 1] We do, obj[("particle_positions_x", 1)] (or obj[("particle_positions_x", "PopIICluster")] if you are using a code that supports such expression.) And if a field is found to be a particle type field, but without a tuple, we simply raise a KeyError. We can emulate old behavior with: obj[("particle_position_x",None)] But I am inclined to say that I would rather *mandate* particle type selection, rather than merely allow it. This will lead to much greater clarity, I think. [+-][01] on == Particle Access == Currently, we access particles off a data_object: sphere["particle_position_x"] Do we want to keep this, or move to being much more explicit, that we require sphere.particles["particle_position_x"] ? I can see both sides to this. Explicit is good, and would allow us to be both more clear with users and about fields (and with our IO methods) but allowing old behavior encourages not having to necessarily worry about if something is a fluid or a particle, which is a good thing I think. == User Survey == Should we conduct a brief (only a handful of questions) polling users for their feelings on the changes or potential changes to nomenclature, particles, and so on? We could use it also as a time to identify other outstanding issues with yt. = Conclusion = Does this all make a sort of sense? [+-][01] on the design presented so far If you're interested in helping out, please get in touch. I'm still laying groundwork in some areas, but in other areas I would appreciate help and advice moving forward. Thanks to everyone who made it this far, Matt
Great stuff Matt. +1 on the presented GeometryHandler design. It doesn't sound like it changes performance much and it's a nicer organization. I like the chunking work, but I'm confused about syntax and what goes where. Does each data object define its own chunks method? In the chunks method, would it be clearer to return something like a "chunk" data object instead of self? This probably doesn't matter though, since a user would not care about type(ds). I haven't worked with particle types in yt (it's all DM to me!), but I would prefer the `obj.particles["position_x"]` syntax. With the current design, I don't like that I have to inspect fields to see if they are particle fields or on the grid. This is fine now because I am familiar with the fields in my data, but I remember this being confusing as a new user. For particle type selection, I think syntax like `obj.particles["PopIICluster"]["position_x"]` or even `obj.particles.type("PopIICluster")["position_x"]` is clearer. With a tuple it almost looks like you are selecting two fields. If we go with the tuple though, can we make the particle type the first element? I have no idea about mandating a particle type, but we can hash that out in the hangout. Best, Casey On Fri, Feb 24, 2012 at 8:12 AM, Matthew Turk <matthewturk@gmail.com> wrote:
Hi all,
This email is going to be long, but the changes I am proposing in it will require feedback. Next Monday at 1PM Eastern I'll be holding a hangout on Google+, if you would like to join in and discuss it with lower latency. (Of course, if nobody shows up, that's okay too.) We don't really have a detailed review process for design decisions like this, and I hope that this discussion will guide developing that kind of review process in the future.
Everything described here is in the branch geometry_handling at https://bitbucket.org/MatthewTurk/yt-refactor
= Background =
Geometry in yt has always been designed around two things:
1) Grid patches 2) Spatial location based on bounding boxes
This has presented a number of excellent opportunities for optimizing patch-based AMR data. However, it has posed challenges for supporting particle-based datasets. (Particles inside patch-based data are somewhat easier, although if you want to see one of the worst pieces of yt, look no further than FakeGridForParticles, which mocks up particles as grids with varying x,y,z, so that existing fluid-selection routines can be used.) Furthermore, the state of Octree codes is not good. For relatively small (up to about 64) domains in RAMSES, we can do analysis with relative ease -- volume rendering, projections, fluid analysis, etc, all work very nicely. However, what is required is a costly step of regridding octs at the outset, to form grid patches with some efficiency. This regridding is not stored between runs, and it's not *efficient*.
This has worked so far, but it is showing its strain. As it stands, yt cannot directly support SPH codes without somehow making SPH look like grids. Octree codes can't be supported without making them look like grid codes. Both of these are very slow operations. And on top of that, the current state of how data is read in, even for grid codes, is a bit gross. There are a number of conditionals, mechanisms for striding over data (i.e., "_get_grids()" which is called in parallel, and iterates over grids, and sometimes you have to call _get_point_indices and sometimes you don't), and probably worst of all a lot of unnecessary memory duplication because arrays in regions/spheres/etc are constructed by iterating over grids, filling in a list of arrays, and then concatenating that list into a second array.
Furthermore, several times recently the question has come up of handling and supporting more generic geometries: spherical, polar, cylindrical. Unfortunately, because we're so heavily based around the idea of grids as is, these are difficult to support. All of our collision detection has been designed with rectilinear surfaces in mind. That being said, I'm really quite happy with how yt *functions* for patch-based grids. I'm just not sure that as it stands it is maintainable, and I think that addressing the geometry issues could ultimately also help out a great deal with scalability. (More on that below.)
= Geometry Overhaul =
For the last couple weeks, I have been working on an overhaul of the geometry system in yt, attempting to handle parallel striding in a format-agnostic way, deal with generating fields in as simple a way as possible, and also attempting to speed things up. I'm going to summarize below the changes that I've made, the new design, and the current status.
== Design: Selectors ==
As it stands, yt data objects (such as Region, Sphere, etc) are mostly thought of as conduits for data. They apply a criteria to the domain and select all of the fluid points or particles within a region or that fit a specific non-spatial criteria. I have extended this concept to construct very lightweight objects called selectors, similar to how ParticleIO is now conducted for Enzo. These are objects in Cython that accept either a set of bounding-box arrays or a set of spatial points and decide if those points are included in a given region. This is essentially a combination of the get_point_indices and get_cut_mask methods that currently exist. These have all been designed to be GIL-free, so that in the future they can be threaded using OpenMP (although it turns out not very much time is spent on them.) To date I have implemented spheres, regions, disks, cutting planes, slices, ortho rays, and grid collections. They all give bitwise identical results to the old method and are almost all substantially faster. (Cutting planes are particularly better, regions are slightly slower, and the others are all about 25-40% faster.)
One primary advantage is that these can (and do) trivially provide counting methods, masking methods, and point-evaluation methods. So we can use the same routines to select particles and individual points (for collision detection) that we use for selecting grid cells. This will allow reuse of data object code with particle outputs, and is a primary advantage of the refactor.
This also led to separating selection-based data containers from constructed data containers.
== Design: Ditching the Hierarchy ==
(Hopefully that heading caught some eyes.) In line with attempting to reduce Enzo-isms in the code, I have been moving away from specifically calling something a "hierarchy," and instead moving toward referring to geometry handlers. grids, grid_left_edges, grid_right_edges, and so on will no longer be guaranteed to appear hanging off these objects. The new class structure:
GeometryHandler GridGeometryHandler(GeometryHandler) OctreeGeometryHandler(GeometryHandler)
and eventually, other methods. The concept behind this is to isolate those items that *must* require domain knowledge, mesh knowledge, or particle/fluid distribution knowledge, and place those in the GeometryHandler. All items specific to patch based or block based AMR will go in GridGeometryHandler and so on. This will allow for fundamentally different method of IO for octrees from grids, for instance. The geometry handlers are all required to implement the following methods:
_read_selection(self, fields, data_object, chunk) _chunk(self, data_object, chunking_style, ngz = 0)
I will describe chunks later. Note that the first routine, _read_selection, accepts the names of the fields, and the data object that describes how to select which portions of a dataset participate in a given selection. For the specific case of Enzo (as I am still in design phase, I have only been working with Enzo data, although I will move on to FLASH and Boxlib next) read_selection calls a routine that counts and masks grids, batches operations to files, and then returns only the necessary data for a selection. The data object does no discarding of out-of-bounds or child-masked data.
== Design: Chunking ==
Currently, yt uses de facto chunking of data in the form of grids. When Derived Quantities want to handle individual grids, one at a time, they "preload" the data from whatever grids the ParallelAnalysisInterface thinks they deserve. These grids are iterated over, and handled individually, then the result is combined at the end. Profiles do something similar. However, both of these are de facto, and not really designed. They rely on calls to semi-private functions on data objects, manually masking data, on and on.
To get around this ugly spot, I have started working on redesigning IO and data handling to function in "chunks." In this method, data objects will request that the geometry handler supply a set of chunks (the function _chunk, described above, conducts this task.) Chunks are of the form (IO_unit, size), where IO_unit is only ever managed or handled by _read_selection, above.
To start the Chunks shuffling over the output, the code calls data_source.chunks(fields, chunking_style). Right now only "spatial", "io" and "all" are supported for chunking styles. This corresponds to spatially-oriented division, IO-conserving, and all-at-once (not usually relevant.) What happens next is kind of fun. The chunks function looks like this:
def chunks(self, fields, chunking_style, **kwargs): for chunk in self.hierarchy._chunk(self, chunking_style, **kwargs): with self._chunked_read(chunk): self.get_data(fields) yield self
Note what it does here -- it actually yields *itself*. However, inside the chunked_read function, what happens is that the attributes corresponding to the size, the current data source, and so on, are set by the geometry handler (still called a hierarchy here.) So, for instance, execution might look like this:
for ds in my_obj.chunks(["Density"], "spatial"): print ds is my_obj print ds["Density"].size
The first line will actually print True, but the results from the second one will be the size of (for instance) the grid it's currently iterating over. In this way, it becomes much easier to stride over subsets of data. Derived quantities now look like this:
chunks = self._data_source.chunks([], chunking_style="io") for ds in parallel_objects(chunks, -1): rv = self.func(ds, *args, **kwargs)
It chunks things, evaluates and then stores intermediate results.
This is not meant to replace spatial decomposition in parallel jobs, but it *is* designed to enable much easier and *mesh-neutral* division of labor for parallelism and for IO. If we were to call chunk on an octree, it no longer has to make things look like grids; it just makes them look like flattened arrays (unless you chunk over spatial, which I haven't gotten into yet.)
Essentially, by making the method of subsetting and striding over subsetted data more compartmentalized, the code becomes more clear and more maintainable.
= Status =
As it stands, I am able to select data very nicely -- this means slices, profiles, derived quantities all work. However (see below) particles still need some work and handling, and constructed data objects (smoothed covering grids, projections) have yet to be ported.
= Open Questions =
However, the main reason I am writing -- when the code is not yet fully functional! -- is to ask for advice and consideration on a few points.
== Particle Types ==
In the past, we've talked about handling particle types differently depending on the particle. Because this is destined for yt-3.0, I would like to propose that we break compatibility with old scripts in favor of clarity in our particle types. Specifically, rather than doing:
obj["particle_position_x"][obj["particle_type"] == 1]
We do,
obj[("particle_positions_x", 1)]
(or obj[("particle_positions_x", "PopIICluster")] if you are using a code that supports such expression.)
And if a field is found to be a particle type field, but without a tuple, we simply raise a KeyError. We can emulate old behavior with:
obj[("particle_position_x",None)]
But I am inclined to say that I would rather *mandate* particle type selection, rather than merely allow it. This will lead to much greater clarity, I think.
[+-][01] on
== Particle Access ==
Currently, we access particles off a data_object:
sphere["particle_position_x"]
Do we want to keep this, or move to being much more explicit, that we require sphere.particles["particle_position_x"] ? I can see both sides to this. Explicit is good, and would allow us to be both more clear with users and about fields (and with our IO methods) but allowing old behavior encourages not having to necessarily worry about if something is a fluid or a particle, which is a good thing I think.
== User Survey ==
Should we conduct a brief (only a handful of questions) polling users for their feelings on the changes or potential changes to nomenclature, particles, and so on? We could use it also as a time to identify other outstanding issues with yt.
= Conclusion =
Does this all make a sort of sense?
[+-][01] on the design presented so far
If you're interested in helping out, please get in touch. I'm still laying groundwork in some areas, but in other areas I would appreciate help and advice moving forward.
Thanks to everyone who made it this far,
Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
I agree with Casey about the particle UI. And am +1 on changing the particle UI. It should be easy to programmatically get the list of all particle types in a simulation and from there get the list of all particle fields associated with that particle type. The dict of dicts organization - obj.particles["PopIICluster"]["position_x"]) - makes the most sense to me since you can use particles.keys() and particles["PopIICluster"].keys() functions. Nathan Goldbaum Graduate Student Astronomy & Astrophysics, UCSC goldbaum@ucolick.org http://www.ucolick.org/~goldbaum On Feb 24, 2012, at 10:46 AM, Casey W. Stark wrote:
Great stuff Matt.
+1 on the presented GeometryHandler design. It doesn't sound like it changes performance much and it's a nicer organization.
I like the chunking work, but I'm confused about syntax and what goes where. Does each data object define its own chunks method? In the chunks method, would it be clearer to return something like a "chunk" data object instead of self? This probably doesn't matter though, since a user would not care about type(ds).
I haven't worked with particle types in yt (it's all DM to me!), but I would prefer the `obj.particles["position_x"]` syntax. With the current design, I don't like that I have to inspect fields to see if they are particle fields or on the grid. This is fine now because I am familiar with the fields in my data, but I remember this being confusing as a new user.
For particle type selection, I think syntax like `obj.particles["PopIICluster"]["position_x"]` or even `obj.particles.type("PopIICluster")["position_x"]` is clearer. With a tuple it almost looks like you are selecting two fields. If we go with the tuple though, can we make the particle type the first element? I have no idea about mandating a particle type, but we can hash that out in the hangout.
Best, Casey
On Fri, Feb 24, 2012 at 8:12 AM, Matthew Turk <matthewturk@gmail.com> wrote: Hi all,
This email is going to be long, but the changes I am proposing in it will require feedback. Next Monday at 1PM Eastern I'll be holding a hangout on Google+, if you would like to join in and discuss it with lower latency. (Of course, if nobody shows up, that's okay too.) We don't really have a detailed review process for design decisions like this, and I hope that this discussion will guide developing that kind of review process in the future.
Everything described here is in the branch geometry_handling at https://bitbucket.org/MatthewTurk/yt-refactor
= Background =
Geometry in yt has always been designed around two things:
1) Grid patches 2) Spatial location based on bounding boxes
This has presented a number of excellent opportunities for optimizing patch-based AMR data. However, it has posed challenges for supporting particle-based datasets. (Particles inside patch-based data are somewhat easier, although if you want to see one of the worst pieces of yt, look no further than FakeGridForParticles, which mocks up particles as grids with varying x,y,z, so that existing fluid-selection routines can be used.) Furthermore, the state of Octree codes is not good. For relatively small (up to about 64) domains in RAMSES, we can do analysis with relative ease -- volume rendering, projections, fluid analysis, etc, all work very nicely. However, what is required is a costly step of regridding octs at the outset, to form grid patches with some efficiency. This regridding is not stored between runs, and it's not *efficient*.
This has worked so far, but it is showing its strain. As it stands, yt cannot directly support SPH codes without somehow making SPH look like grids. Octree codes can't be supported without making them look like grid codes. Both of these are very slow operations. And on top of that, the current state of how data is read in, even for grid codes, is a bit gross. There are a number of conditionals, mechanisms for striding over data (i.e., "_get_grids()" which is called in parallel, and iterates over grids, and sometimes you have to call _get_point_indices and sometimes you don't), and probably worst of all a lot of unnecessary memory duplication because arrays in regions/spheres/etc are constructed by iterating over grids, filling in a list of arrays, and then concatenating that list into a second array.
Furthermore, several times recently the question has come up of handling and supporting more generic geometries: spherical, polar, cylindrical. Unfortunately, because we're so heavily based around the idea of grids as is, these are difficult to support. All of our collision detection has been designed with rectilinear surfaces in mind. That being said, I'm really quite happy with how yt *functions* for patch-based grids. I'm just not sure that as it stands it is maintainable, and I think that addressing the geometry issues could ultimately also help out a great deal with scalability. (More on that below.)
= Geometry Overhaul =
For the last couple weeks, I have been working on an overhaul of the geometry system in yt, attempting to handle parallel striding in a format-agnostic way, deal with generating fields in as simple a way as possible, and also attempting to speed things up. I'm going to summarize below the changes that I've made, the new design, and the current status.
== Design: Selectors ==
As it stands, yt data objects (such as Region, Sphere, etc) are mostly thought of as conduits for data. They apply a criteria to the domain and select all of the fluid points or particles within a region or that fit a specific non-spatial criteria. I have extended this concept to construct very lightweight objects called selectors, similar to how ParticleIO is now conducted for Enzo. These are objects in Cython that accept either a set of bounding-box arrays or a set of spatial points and decide if those points are included in a given region. This is essentially a combination of the get_point_indices and get_cut_mask methods that currently exist. These have all been designed to be GIL-free, so that in the future they can be threaded using OpenMP (although it turns out not very much time is spent on them.) To date I have implemented spheres, regions, disks, cutting planes, slices, ortho rays, and grid collections. They all give bitwise identical results to the old method and are almost all substantially faster. (Cutting planes are particularly better, regions are slightly slower, and the others are all about 25-40% faster.)
One primary advantage is that these can (and do) trivially provide counting methods, masking methods, and point-evaluation methods. So we can use the same routines to select particles and individual points (for collision detection) that we use for selecting grid cells. This will allow reuse of data object code with particle outputs, and is a primary advantage of the refactor.
This also led to separating selection-based data containers from constructed data containers.
== Design: Ditching the Hierarchy ==
(Hopefully that heading caught some eyes.) In line with attempting to reduce Enzo-isms in the code, I have been moving away from specifically calling something a "hierarchy," and instead moving toward referring to geometry handlers. grids, grid_left_edges, grid_right_edges, and so on will no longer be guaranteed to appear hanging off these objects. The new class structure:
GeometryHandler GridGeometryHandler(GeometryHandler) OctreeGeometryHandler(GeometryHandler)
and eventually, other methods. The concept behind this is to isolate those items that *must* require domain knowledge, mesh knowledge, or particle/fluid distribution knowledge, and place those in the GeometryHandler. All items specific to patch based or block based AMR will go in GridGeometryHandler and so on. This will allow for fundamentally different method of IO for octrees from grids, for instance. The geometry handlers are all required to implement the following methods:
_read_selection(self, fields, data_object, chunk) _chunk(self, data_object, chunking_style, ngz = 0)
I will describe chunks later. Note that the first routine, _read_selection, accepts the names of the fields, and the data object that describes how to select which portions of a dataset participate in a given selection. For the specific case of Enzo (as I am still in design phase, I have only been working with Enzo data, although I will move on to FLASH and Boxlib next) read_selection calls a routine that counts and masks grids, batches operations to files, and then returns only the necessary data for a selection. The data object does no discarding of out-of-bounds or child-masked data.
== Design: Chunking ==
Currently, yt uses de facto chunking of data in the form of grids. When Derived Quantities want to handle individual grids, one at a time, they "preload" the data from whatever grids the ParallelAnalysisInterface thinks they deserve. These grids are iterated over, and handled individually, then the result is combined at the end. Profiles do something similar. However, both of these are de facto, and not really designed. They rely on calls to semi-private functions on data objects, manually masking data, on and on.
To get around this ugly spot, I have started working on redesigning IO and data handling to function in "chunks." In this method, data objects will request that the geometry handler supply a set of chunks (the function _chunk, described above, conducts this task.) Chunks are of the form (IO_unit, size), where IO_unit is only ever managed or handled by _read_selection, above.
To start the Chunks shuffling over the output, the code calls data_source.chunks(fields, chunking_style). Right now only "spatial", "io" and "all" are supported for chunking styles. This corresponds to spatially-oriented division, IO-conserving, and all-at-once (not usually relevant.) What happens next is kind of fun. The chunks function looks like this:
def chunks(self, fields, chunking_style, **kwargs): for chunk in self.hierarchy._chunk(self, chunking_style, **kwargs): with self._chunked_read(chunk): self.get_data(fields) yield self
Note what it does here -- it actually yields *itself*. However, inside the chunked_read function, what happens is that the attributes corresponding to the size, the current data source, and so on, are set by the geometry handler (still called a hierarchy here.) So, for instance, execution might look like this:
for ds in my_obj.chunks(["Density"], "spatial"): print ds is my_obj print ds["Density"].size
The first line will actually print True, but the results from the second one will be the size of (for instance) the grid it's currently iterating over. In this way, it becomes much easier to stride over subsets of data. Derived quantities now look like this:
chunks = self._data_source.chunks([], chunking_style="io") for ds in parallel_objects(chunks, -1): rv = self.func(ds, *args, **kwargs)
It chunks things, evaluates and then stores intermediate results.
This is not meant to replace spatial decomposition in parallel jobs, but it *is* designed to enable much easier and *mesh-neutral* division of labor for parallelism and for IO. If we were to call chunk on an octree, it no longer has to make things look like grids; it just makes them look like flattened arrays (unless you chunk over spatial, which I haven't gotten into yet.)
Essentially, by making the method of subsetting and striding over subsetted data more compartmentalized, the code becomes more clear and more maintainable.
= Status =
As it stands, I am able to select data very nicely -- this means slices, profiles, derived quantities all work. However (see below) particles still need some work and handling, and constructed data objects (smoothed covering grids, projections) have yet to be ported.
= Open Questions =
However, the main reason I am writing -- when the code is not yet fully functional! -- is to ask for advice and consideration on a few points.
== Particle Types ==
In the past, we've talked about handling particle types differently depending on the particle. Because this is destined for yt-3.0, I would like to propose that we break compatibility with old scripts in favor of clarity in our particle types. Specifically, rather than doing:
obj["particle_position_x"][obj["particle_type"] == 1]
We do,
obj[("particle_positions_x", 1)]
(or obj[("particle_positions_x", "PopIICluster")] if you are using a code that supports such expression.)
And if a field is found to be a particle type field, but without a tuple, we simply raise a KeyError. We can emulate old behavior with:
obj[("particle_position_x",None)]
But I am inclined to say that I would rather *mandate* particle type selection, rather than merely allow it. This will lead to much greater clarity, I think.
[+-][01] on
== Particle Access ==
Currently, we access particles off a data_object:
sphere["particle_position_x"]
Do we want to keep this, or move to being much more explicit, that we require sphere.particles["particle_position_x"] ? I can see both sides to this. Explicit is good, and would allow us to be both more clear with users and about fields (and with our IO methods) but allowing old behavior encourages not having to necessarily worry about if something is a fluid or a particle, which is a good thing I think.
== User Survey ==
Should we conduct a brief (only a handful of questions) polling users for their feelings on the changes or potential changes to nomenclature, particles, and so on? We could use it also as a time to identify other outstanding issues with yt.
= Conclusion =
Does this all make a sort of sense?
[+-][01] on the design presented so far
If you're interested in helping out, please get in touch. I'm still laying groundwork in some areas, but in other areas I would appreciate help and advice moving forward.
Thanks to everyone who made it this far,
Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
!DSPAM:10175,4f47db1f213559955021! _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
!DSPAM:10175,4f47db1f213559955021!
Hi Matt, I think the geometry refactoring will have a huge impact on the (octree-based) ART frontend, I give it an emphatic +1! I've been trying to wrap my head around our hacked-together grid-cum-octree process, and it's been difficult to reconcile both ideas in a natural way. Having a geometry handler, and two different derived classes for the two architectures is, I think, much clearer. I see how this changes the organization of the frontend, but I'm fuzzy on how this affects data_objects downstream - it sounds like that needs rewriting? You also mentioned that yt doesn't support SPH codes without casting particles as grids in some way - is this improved with the refactoring? chris On Fri, Feb 24, 2012 at 11:27 AM, Nathan Goldbaum <goldbaum@ucolick.org>wrote:
I agree with Casey about the particle UI. And am +1 on changing the particle UI.
It should be easy to programmatically get the list of all particle types in a simulation and from there get the list of all particle fields associated with that particle type. The dict of dicts organization - obj.particles["PopIICluster"]["position_x"]) - makes the most sense to me since you can use particles.keys() and particles["PopIICluster"].keys() functions.
Nathan Goldbaum Graduate Student Astronomy & Astrophysics, UCSC goldbaum@ucolick.org http://www.ucolick.org/~goldbaum
On Feb 24, 2012, at 10:46 AM, Casey W. Stark wrote:
Great stuff Matt.
+1 on the presented GeometryHandler design. It doesn't sound like it changes performance much and it's a nicer organization.
I like the chunking work, but I'm confused about syntax and what goes where. Does each data object define its own chunks method? In the chunks method, would it be clearer to return something like a "chunk" data object instead of self? This probably doesn't matter though, since a user would not care about type(ds).
I haven't worked with particle types in yt (it's all DM to me!), but I would prefer the `obj.particles["position_x"]` syntax. With the current design, I don't like that I have to inspect fields to see if they are particle fields or on the grid. This is fine now because I am familiar with the fields in my data, but I remember this being confusing as a new user.
For particle type selection, I think syntax like `obj.particles["PopIICluster"]["position_x"]` or even `obj.particles.type("PopIICluster")["position_x"]` is clearer. With a tuple it almost looks like you are selecting two fields. If we go with the tuple though, can we make the particle type the first element? I have no idea about mandating a particle type, but we can hash that out in the hangout.
Best, Casey
On Fri, Feb 24, 2012 at 8:12 AM, Matthew Turk <matthewturk@gmail.com>wrote:
Hi all,
This email is going to be long, but the changes I am proposing in it will require feedback. Next Monday at 1PM Eastern I'll be holding a hangout on Google+, if you would like to join in and discuss it with lower latency. (Of course, if nobody shows up, that's okay too.) We don't really have a detailed review process for design decisions like this, and I hope that this discussion will guide developing that kind of review process in the future.
Everything described here is in the branch geometry_handling at https://bitbucket.org/MatthewTurk/yt-refactor
= Background =
Geometry in yt has always been designed around two things:
1) Grid patches 2) Spatial location based on bounding boxes
This has presented a number of excellent opportunities for optimizing patch-based AMR data. However, it has posed challenges for supporting particle-based datasets. (Particles inside patch-based data are somewhat easier, although if you want to see one of the worst pieces of yt, look no further than FakeGridForParticles, which mocks up particles as grids with varying x,y,z, so that existing fluid-selection routines can be used.) Furthermore, the state of Octree codes is not good. For relatively small (up to about 64) domains in RAMSES, we can do analysis with relative ease -- volume rendering, projections, fluid analysis, etc, all work very nicely. However, what is required is a costly step of regridding octs at the outset, to form grid patches with some efficiency. This regridding is not stored between runs, and it's not *efficient*.
This has worked so far, but it is showing its strain. As it stands, yt cannot directly support SPH codes without somehow making SPH look like grids. Octree codes can't be supported without making them look like grid codes. Both of these are very slow operations. And on top of that, the current state of how data is read in, even for grid codes, is a bit gross. There are a number of conditionals, mechanisms for striding over data (i.e., "_get_grids()" which is called in parallel, and iterates over grids, and sometimes you have to call _get_point_indices and sometimes you don't), and probably worst of all a lot of unnecessary memory duplication because arrays in regions/spheres/etc are constructed by iterating over grids, filling in a list of arrays, and then concatenating that list into a second array.
Furthermore, several times recently the question has come up of handling and supporting more generic geometries: spherical, polar, cylindrical. Unfortunately, because we're so heavily based around the idea of grids as is, these are difficult to support. All of our collision detection has been designed with rectilinear surfaces in mind. That being said, I'm really quite happy with how yt *functions* for patch-based grids. I'm just not sure that as it stands it is maintainable, and I think that addressing the geometry issues could ultimately also help out a great deal with scalability. (More on that below.)
= Geometry Overhaul =
For the last couple weeks, I have been working on an overhaul of the geometry system in yt, attempting to handle parallel striding in a format-agnostic way, deal with generating fields in as simple a way as possible, and also attempting to speed things up. I'm going to summarize below the changes that I've made, the new design, and the current status.
== Design: Selectors ==
As it stands, yt data objects (such as Region, Sphere, etc) are mostly thought of as conduits for data. They apply a criteria to the domain and select all of the fluid points or particles within a region or that fit a specific non-spatial criteria. I have extended this concept to construct very lightweight objects called selectors, similar to how ParticleIO is now conducted for Enzo. These are objects in Cython that accept either a set of bounding-box arrays or a set of spatial points and decide if those points are included in a given region. This is essentially a combination of the get_point_indices and get_cut_mask methods that currently exist. These have all been designed to be GIL-free, so that in the future they can be threaded using OpenMP (although it turns out not very much time is spent on them.) To date I have implemented spheres, regions, disks, cutting planes, slices, ortho rays, and grid collections. They all give bitwise identical results to the old method and are almost all substantially faster. (Cutting planes are particularly better, regions are slightly slower, and the others are all about 25-40% faster.)
One primary advantage is that these can (and do) trivially provide counting methods, masking methods, and point-evaluation methods. So we can use the same routines to select particles and individual points (for collision detection) that we use for selecting grid cells. This will allow reuse of data object code with particle outputs, and is a primary advantage of the refactor.
This also led to separating selection-based data containers from constructed data containers.
== Design: Ditching the Hierarchy ==
(Hopefully that heading caught some eyes.) In line with attempting to reduce Enzo-isms in the code, I have been moving away from specifically calling something a "hierarchy," and instead moving toward referring to geometry handlers. grids, grid_left_edges, grid_right_edges, and so on will no longer be guaranteed to appear hanging off these objects. The new class structure:
GeometryHandler GridGeometryHandler(GeometryHandler) OctreeGeometryHandler(GeometryHandler)
and eventually, other methods. The concept behind this is to isolate those items that *must* require domain knowledge, mesh knowledge, or particle/fluid distribution knowledge, and place those in the GeometryHandler. All items specific to patch based or block based AMR will go in GridGeometryHandler and so on. This will allow for fundamentally different method of IO for octrees from grids, for instance. The geometry handlers are all required to implement the following methods:
_read_selection(self, fields, data_object, chunk) _chunk(self, data_object, chunking_style, ngz = 0)
I will describe chunks later. Note that the first routine, _read_selection, accepts the names of the fields, and the data object that describes how to select which portions of a dataset participate in a given selection. For the specific case of Enzo (as I am still in design phase, I have only been working with Enzo data, although I will move on to FLASH and Boxlib next) read_selection calls a routine that counts and masks grids, batches operations to files, and then returns only the necessary data for a selection. The data object does no discarding of out-of-bounds or child-masked data.
== Design: Chunking ==
Currently, yt uses de facto chunking of data in the form of grids. When Derived Quantities want to handle individual grids, one at a time, they "preload" the data from whatever grids the ParallelAnalysisInterface thinks they deserve. These grids are iterated over, and handled individually, then the result is combined at the end. Profiles do something similar. However, both of these are de facto, and not really designed. They rely on calls to semi-private functions on data objects, manually masking data, on and on.
To get around this ugly spot, I have started working on redesigning IO and data handling to function in "chunks." In this method, data objects will request that the geometry handler supply a set of chunks (the function _chunk, described above, conducts this task.) Chunks are of the form (IO_unit, size), where IO_unit is only ever managed or handled by _read_selection, above.
To start the Chunks shuffling over the output, the code calls data_source.chunks(fields, chunking_style). Right now only "spatial", "io" and "all" are supported for chunking styles. This corresponds to spatially-oriented division, IO-conserving, and all-at-once (not usually relevant.) What happens next is kind of fun. The chunks function looks like this:
def chunks(self, fields, chunking_style, **kwargs): for chunk in self.hierarchy._chunk(self, chunking_style, **kwargs): with self._chunked_read(chunk): self.get_data(fields) yield self
Note what it does here -- it actually yields *itself*. However, inside the chunked_read function, what happens is that the attributes corresponding to the size, the current data source, and so on, are set by the geometry handler (still called a hierarchy here.) So, for instance, execution might look like this:
for ds in my_obj.chunks(["Density"], "spatial"): print ds is my_obj print ds["Density"].size
The first line will actually print True, but the results from the second one will be the size of (for instance) the grid it's currently iterating over. In this way, it becomes much easier to stride over subsets of data. Derived quantities now look like this:
chunks = self._data_source.chunks([], chunking_style="io") for ds in parallel_objects(chunks, -1): rv = self.func(ds, *args, **kwargs)
It chunks things, evaluates and then stores intermediate results.
This is not meant to replace spatial decomposition in parallel jobs, but it *is* designed to enable much easier and *mesh-neutral* division of labor for parallelism and for IO. If we were to call chunk on an octree, it no longer has to make things look like grids; it just makes them look like flattened arrays (unless you chunk over spatial, which I haven't gotten into yet.)
Essentially, by making the method of subsetting and striding over subsetted data more compartmentalized, the code becomes more clear and more maintainable.
= Status =
As it stands, I am able to select data very nicely -- this means slices, profiles, derived quantities all work. However (see below) particles still need some work and handling, and constructed data objects (smoothed covering grids, projections) have yet to be ported.
= Open Questions =
However, the main reason I am writing -- when the code is not yet fully functional! -- is to ask for advice and consideration on a few points.
== Particle Types ==
In the past, we've talked about handling particle types differently depending on the particle. Because this is destined for yt-3.0, I would like to propose that we break compatibility with old scripts in favor of clarity in our particle types. Specifically, rather than doing:
obj["particle_position_x"][obj["particle_type"] == 1]
We do,
obj[("particle_positions_x", 1)]
(or obj[("particle_positions_x", "PopIICluster")] if you are using a code that supports such expression.)
And if a field is found to be a particle type field, but without a tuple, we simply raise a KeyError. We can emulate old behavior with:
obj[("particle_position_x",None)]
But I am inclined to say that I would rather *mandate* particle type selection, rather than merely allow it. This will lead to much greater clarity, I think.
[+-][01] on
== Particle Access ==
Currently, we access particles off a data_object:
sphere["particle_position_x"]
Do we want to keep this, or move to being much more explicit, that we require sphere.particles["particle_position_x"] ? I can see both sides to this. Explicit is good, and would allow us to be both more clear with users and about fields (and with our IO methods) but allowing old behavior encourages not having to necessarily worry about if something is a fluid or a particle, which is a good thing I think.
== User Survey ==
Should we conduct a brief (only a handful of questions) polling users for their feelings on the changes or potential changes to nomenclature, particles, and so on? We could use it also as a time to identify other outstanding issues with yt.
= Conclusion =
Does this all make a sort of sense?
[+-][01] on the design presented so far
If you're interested in helping out, please get in touch. I'm still laying groundwork in some areas, but in other areas I would appreciate help and advice moving forward.
Thanks to everyone who made it this far,
Matt _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
!DSPAM:10175,4f47db1f213559955021! _______________________________________________
yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
!DSPAM:10175,4f47db1f213559955021!
_______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi all, One thought comes to mind with regards to selecting particle types as part of the new method: There needs to be a simple and memory efficient way to select multiple types of particles (or all types). For example, for the halo finders, sometimes only dark matter is wanted, and other times stars (or perhaps even only certain types of stars) are wanted as well. In the past we've done things to avoid na.concatenate with particles because it fragments memory. I think Matt has thought of this already, but I'd like to extend this discussion to how to allow users to select particle types without having to do na.concatenate manually (or have it happen behind the scenes). With regards to obj['particle_position_x'] or obj.particles['position_x'], I think the argument for changing it is not a bad one, but that if we do this, I think it would make the most sense to extend it to other things, like obj.grid['Density'], for continuity's sake. I'm for a user survey if we think we'll get replies. Have users generally been forthcoming with opinions on technical stuff like this before? I'll see you in the chat on Monday! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi everybody, Just a quick reminder -- at 1PM today I'll be holding a G+ hangout about geometry refactoring. I've prepared a bit of a presentation, and I'll also go over some code, but I mainly want to get some feedback and talk about ideas! This will be a great time to brainstorm as well as provide feedback on what is already in place. The hangout will be on my profile, which is here: https://plus.google.com/106495956319144885505/ That's in just about three hours. No further reminders will be forthcoming. I also wanted to thank everyone for replying, and I'll respond briefly to comments here. One thing I really want to convey, though, is that this refactor is mostly about stopping thinking of everything in terms of *grids*. This is the myopia that would prevent yt from really moving forward; to that end, I do not want *any* data objects in yt to directly access grid objects unless they are specifically tasked with handling grid geometry or handling a Casey -- returning a chunk object is probably cleaner, but I didn't want to add on a bunch of new overloads and potentially duplicated logic. I don't see reading in a chunked fashion as something that is commonly (or ever) exposed to analysis scripts, and if a script author does choose to use chunking, they should know what they are doing. Each data *object* (i.e., sphere, region, etc) does *not* define its own chunking routine, but rather calls the chunking routine from the geometry handler, which is something like a grid-based handler, particle-based, oct-based. These can be overridden by specific codes (for instance, if IO chunking proceeds differently in Nyx versus in Enzo, even though they are both grid based codes.) Nathan -- I like the syntax you and Casey have come up with. We can also add [:] or "all" for particle typing. Chris -- yup, data_objects needs rewriting. That's what I have been doing, and I got first light on plotting slices, cutting planes and profiles this weekend. The harder one will be doing data objects that require some degree of construction, like profiles and covering grids. I hope to start work on those soon. As for if this would help with SPH, the answer is emphatically yes! While actually calculating fluid quantities in particles that require neighbor searches will still be challenging, the process of loading and looking at SPH data in the raw should be greatly improved. As an example of what we will be able to do with Octs, we should be able to load ART data, store the index in memory, chunk by whatever is best for IO (sorted by resolution, perhaps) and add to the exported, resorted octree without any regridding. I'd like to work on this with you very soon. Stephen -- I am not yet sure about implementation of particle IO, but I would see selection of all particles in a given region being available via something like obj.particles[:]["whatever_field"]. I think we can mostly avoid concatenation in this way. As for accessing things via .grid[field], I am -1 on this since I don't really want to be thinking about grids. We could consider .fluid, but in general I think I am okay with continuing to keep fluids first class and particles separate from them. I am willing to be convinced otherwise. Thanks everybody, Matt On Sat, Feb 25, 2012 at 7:26 PM, Stephen Skory <s@skory.us> wrote:
Hi all,
One thought comes to mind with regards to selecting particle types as part of the new method: There needs to be a simple and memory efficient way to select multiple types of particles (or all types). For example, for the halo finders, sometimes only dark matter is wanted, and other times stars (or perhaps even only certain types of stars) are wanted as well. In the past we've done things to avoid na.concatenate with particles because it fragments memory. I think Matt has thought of this already, but I'd like to extend this discussion to how to allow users to select particle types without having to do na.concatenate manually (or have it happen behind the scenes).
With regards to obj['particle_position_x'] or obj.particles['position_x'], I think the argument for changing it is not a bad one, but that if we do this, I think it would make the most sense to extend it to other things, like obj.grid['Density'], for continuity's sake.
I'm for a user survey if we think we'll get replies. Have users generally been forthcoming with opinions on technical stuff like this before?
I'll see you in the chat on Monday!
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt,
As for accessing things via .grid[field], I am -1 on this since I don't really want to be thinking about grids. We could consider .fluid, but in general I think I am okay with continuing to keep fluids first class and particles separate from them. I am willing to be convinced otherwise.
I realized as soon as I sent it I didn't mean what I had written, but it wasn't important enough to send another email about. Yes, I agree that .grid is not what we want. I was thinking "we have .particles, so we should also use .grid for grid quantities like density." I think there are several reasons why we shouldn't make accessing particles different from fluid quantities, but we can chat about that in "person". -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
participants (5)
-
Casey W. Stark
-
Christopher Moody
-
Matthew Turk
-
Nathan Goldbaum
-
Stephen Skory