
Hi Nathaniel,
- For general operations on n-dimensional sparse arrays, COO format seems like the obvious thing to use. The strategy I've envisioned is storing the data as N arrays of indices + 1 array of values (or possibly you'd want to pack these together into a single array of (struct containing N+1 values)), plus a bit of metadata indicating the current sort order -- so sort_order=(0, 1) would mean that the items are sorted lexicographically by row and then by column (so like CSC), and sort_order=(1, 0) would mean the opposite. Then operations like sum(axis=1) would be "sort by axis 1 (if not already there); then sum over each contiguous chunk of the data"; sum(axis=(0, 2)) would be "sort to make sure axes 0 and 2 are in the front (but I don't care in which order)", etc. It may be useful to use a stable sort, so that if you have sort_order=(0, 1, 2) and then sort by axis 2, you get sort_order=(2, 0, 1). For binary operations, you ensure that the two arrays have the same sort order (possibly re-sorting the smaller one), and then do a parallel iteration over both of them finding matching indices.
- Instead of proliferating data types for all kinds of different storage formats, I'd really prefer to see a single data type that "just works". So for example, instead of having one storage format that's optimized for building matrices and one that's optimized for operations on matrices, you could have a single object which has:
- COO format representation - a DOK of "pending updates"
and writes would go into the DOK part, and at the beginning of each non-trivial operation you'd call self._get_coo_arrays() or whatever, which would do
if self._pending_updates: # do a single batch reallocation and incorporate all the pending updates return self._real_coo_arrays
There are a lot of potentially viable variants on this using different data structures for the various pieces, different detailed reallocation strategies, etc., but you get the idea.
Thanks for sharing your perspective! I really like the idea of the DOK buffer of pending updates. I'll note that what the COO structure you describe looks similar to CJ's SpArray which he mentioned in a parallel thread, and my MapArray can serve as a buffer. Tacking them together seems easy, so one can fairly easily test-drive what you describe. Looking a bit further though, I doubt that one-size-fits-all data data format is realistic. For one, there's a balance to strike between a optimized 2D structure (e.g., PySparse's ll_mat) and a general n-D structure, and there will always be at least one application which requires one or another. (My crude experiments so far give about a factor of 2--5 for CPU and/or memory footprint depending on general tradeoffs.) Ultimately then, I think it makes sense to consider three separate layers: a python frontend, a low-level data structure and a middle layer which transforms, e.g., the logic required by python's __setitem__ into iterators of the low-level data structure. For shaping up this middle layer, an std::map is fine, esp because it's already implemented :-).
For interface with external C/C++/Fortran code that wants CSR or CSC, I'd just have get_csr() and get_csc() methods that return a tuple of (indptr, indices, data), and not even try to implement any actual operations on this representation.
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
It would be really nice if you could just use numpy's own dtype objects and ufuncs instead of implementing your own. I don't know how viable that is right now, but I think it's worth trying, and if you can't then I'd like to know what part broke :-). Happy to help if you need some tips / further explanation.
For ufuncs, it's simple: it does not work ATM :-( . I tried simple conbinations of __array__, __array_prepare__ and __array_finalize__, and they fail for a reason or another. Some details are available in https://github.com/ev-br/sparr/issues/35 In a nutshell, these can probably made to work with some rather simple modifications of to __array_prepare__/__array_wrap__, but I guess it's not very likely given that __numpy_ufunc__ is supposed to come to town, which I'm looking forward to :-). For dtypes, I currently just dispatch on numpy typecodes. Do you mean that, or do you mean actual PyArray_Descr objects? Those I'm not using, and I do not see how to use them, likely because I simply do not understand what they actually are :-). In that department, I'd really really appreciate an explanation! Cheers, Evgeni