
On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
Hi,
TL;DR: Here's a draft implementation of a 'sparse array', suitable for incremental assembly. Comments welcome! https://github.com/ev-br/sparr
Scipy sparse matrices are generally great. One area where they are not very great is the formats suitable for incremental assembly (esp LIL and DOK matrices). This has been discussed for a while, see, for instance, https://github.com/scipy/scipy/pull/2908#discussion_r16955546
I spent some time to see how hard is it to build a dictionary-of-keys data structure without Python overhead. Turns out, not very hard.
Also, when constructing a sparse data structure now, it seems to make sense to make it a sparse *array* rather than *matrix*, so that it supports both elementwise and matrix multiplication.
A draft implementation is available at https://github.com/ev-br/sparr and I'd be grateful for any comments, large or small.
Some usage examples are in the README file which Github shows at the front page above.
First and foremost, I'd like to gauge interest in the community ;-). Does it actually make sense? Would you use such a data structure? What is missing in the current version?
This is really cool! The continuing importance of scipy.sparse is basically the only reason np.matrix hasn't been deprecated yet. I've noodled around a little with designs for similar things at various points, so a few general comments based on that that you can take or leave :-) - For general operations on n-dimensional sparse arrays, COO format seems like the obvious thing to use. The strategy I've envisioned is storing the data as N arrays of indices + 1 array of values (or possibly you'd want to pack these together into a single array of (struct containing N+1 values)), plus a bit of metadata indicating the current sort order -- so sort_order=(0, 1) would mean that the items are sorted lexicographically by row and then by column (so like CSC), and sort_order=(1, 0) would mean the opposite. Then operations like sum(axis=1) would be "sort by axis 1 (if not already there); then sum over each contiguous chunk of the data"; sum(axis=(0, 2)) would be "sort to make sure axes 0 and 2 are in the front (but I don't care in which order)", etc. It may be useful to use a stable sort, so that if you have sort_order=(0, 1, 2) and then sort by axis 2, you get sort_order=(2, 0, 1). For binary operations, you ensure that the two arrays have the same sort order (possibly re-sorting the smaller one), and then do a parallel iteration over both of them finding matching indices. - Instead of proliferating data types for all kinds of different storage formats, I'd really prefer to see a single data type that "just works". So for example, instead of having one storage format that's optimized for building matrices and one that's optimized for operations on matrices, you could have a single object which has: - COO format representation - a DOK of "pending updates" and writes would go into the DOK part, and at the beginning of each non-trivial operation you'd call self._get_coo_arrays() or whatever, which would do if self._pending_updates: # do a single batch reallocation and incorporate all the pending updates return self._real_coo_arrays There are a lot of potentially viable variants on this using different data structures for the various pieces, different detailed reallocation strategies, etc., but you get the idea. For interface with external C/C++/Fortran code that wants CSR or CSC, I'd just have get_csr() and get_csc() methods that return a tuple of (indptr, indices, data), and not even try to implement any actual operations on this representation.
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
It would be really nice if you could just use numpy's own dtype objects and ufuncs instead of implementing your own. I don't know how viable that is right now, but I think it's worth trying, and if you can't then I'd like to know what part broke :-). Happy to help if you need some tips / further explanation. -n -- Nathaniel J. Smith -- https://vorpus.org