I've shared some of this with ChrisM and SamL, but I wanted to bring it up here. This largely pertains to the codes RAMSES and NMSU-ART that are currently supported, as well as the support for particle Octrees. ARTIO is somewhat orthogonal at this time, as it doesn't store the octree in memory.
Currently, Octs have:
int (64-bits) local_ind (64-bits) domain (64-bits) pos (64-bits each) level (64-bits) pointer-to-sd (64-bits) children, 8 pointers (64-bits each) parent (64-bits)
This comes out to 136 bytes. Much of this is because we have two different methods of traversing the Octs, both in allocation-order and in depth-first traversal order, and also because when I wrote this (seems like yesterday) I was trying to get something done without worrying too much about memory optimization. However, both of these things are becoming much more important; firstly, we need to ensure we're thinking about the memory implications of a large number of leaf nodes, and we need to verify that order of traversal is not important -- or at least, is guaranteed identical.
Last week I spent some time rewriting the traversal code to attempt to get rid of much of the self-awareness of the octs, such that *outside* of a traversal, they are somewhat meaningless. This lets me cut it down to:
domain index (which we may be able to eliminate in the future) code-specific index allocateable pointer to the 8 children (linearly) pointer-to-sd object
This should be 32 bytes for leaf nodes, which expands for non-leaf nodes, although we may be able to cut that down as well (at the expense of slight over-allocation for instances when not all cells must be simultaneously refined.) I've implemented this with a traversal object, such that it will track the level state, the oct state, and so on and so forth. This works for all of the tests I have been able to create for it, but it results in a slight performance decrease. (I believe in the future we can optimize this.)
Anyway, here's where I'm going with all of this. I'd like to extend the Octree base class, but in a way that leaves several items up to individual implementations. Furthermore, I'd like to start a brief discussion of how we might plan for this becoming distributed memory, which was impossible before but which I think will be possible by removing state from the Octs.
= New Octree Base Class Methods =
I'd like to propose these methods become part of the base octree class:
* next_root(int domain_id, int ind) * next_child(int domain_id)
RAMSESOctreeContainer implements the first of these, and a version of the second of these items. (Which contains routines that are no longer necessary.) The distinction here is that the first of these routines will be able -- in the future, with no changes to API -- to add the root level oct to a sparse list of octs that belong to a given domain; this will, I hope, enable. Currently it will add them to a root_mesh attribute that lives on all the octrees.
By using this, we'll be able to move more of the addition of new octs into the base Octree class, which should hide all of the logic for adding new octs from the individual implementations.
[+-] on moving these explicitly to the base class and mandating their presence?
Because we're thus going to keep a single method of traversing the octree, this will also mean that icoords, fcoords and ires can all be moved to the base class as well, which is the ultimate reason for the changes to the Octree anyway.
= Sparse Octrees =
it is my hope that we can stop allocating the "mask" variable that gets passed around and is (noct,8) in shape, and instead (if a mask is needed at all) only look at domain-by-domain information. For RAMSES and Gadget this will be a big improvement, but much less so for NMSU-ART and Tipsy, which are currently not set up for efficient multi-domain iteration. This will also mean that spatial chunking will be more efficient, and I think we'll be able to push mask selection deeper into the octree object itself.
However, before we can start looking at distributed octrees with true sparse oct filling across nodes, we'll need to "pin" the parallelism. This would mean disabling the ability to add on new communicators, or possibly splitting workgroups up with an IO handler that communicates across MPI to the nodes that contain the sparse octree -- if we instantiate a sparse octree on N processors, we can't then move to subgroups for instance. So we'll need a way to pin it. But, I think with the changes here to removing global state and using a consistent mechanism for traversing the octree, we'll be able to do so eventually in the future.