RFC: sparse DOK array

Hi, TL;DR: Here's a draft implementation of a 'sparse array', suitable for incremental assembly. Comments welcome! https://github.com/ev-br/sparr Scipy sparse matrices are generally great. One area where they are not very great is the formats suitable for incremental assembly (esp LIL and DOK matrices). This has been discussed for a while, see, for instance, https://github.com/scipy/scipy/pull/2908#discussion_r16955546 I spent some time to see how hard is it to build a dictionary-of-keys data structure without Python overhead. Turns out, not very hard. Also, when constructing a sparse data structure now, it seems to make sense to make it a sparse *array* rather than *matrix*, so that it supports both elementwise and matrix multiplication. A draft implementation is available at https://github.com/ev-br/sparr and I'd be grateful for any comments, large or small. Some usage examples are in the README file which Github shows at the front page above. First and foremost, I'd like to gauge interest in the community ;-). Does it actually make sense? Would you use such a data structure? What is missing in the current version? Short to medium term, some issues I see are: * 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types. I wonder if this is purely backwards compatibility. Naively, it seems to me that a new class could just always use 64-bit indices, but this might be too naive? * Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea. * "Object" dtype. So far, there isn't one. I wonder if it's needed or having only numeric types would be enough. * Interoperation with numpy arrays and other sparse matrices. I guess __numpy_ufunc__ would *the* solution here, when available. For now, I do something simple based on special-casing and __array_priority__. Sparse matrices almost work, but there are glitches. This is just a non-exhaustive list of things I'm thinking about. If there's something else (and I'm sure there is plenty!) I'm all ears --- either in this email thread or in the issue tracker. If someone feels like commenting on implementation details, sure, none of those is cast in stone and plumbing is definitely in a bit of a flux. Cheers, Evgeni

On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski <evgeny.burovskiy@gmail.com
wrote:
First and foremost, I'd like to gauge interest in the community ;-). Does it actually make sense? Would you use such a data structure? What is missing in the current version?
This looks awesome, and makes complete sense to me! In particular, xarray could really use an n-dimensional sparse structure. A few other things small things I'd like to see: - Support for slicing, even if it's expensive. - A strict way to set the shape without automatic expansion, if desired (e.g., if shape is provided in the constructor). - Default to the dtype of the fill_value. NumPy does this for np.full.
Short to medium term, some issues I see are:
* 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types. I wonder if this is purely backwards compatibility. Naively, it seems to me that a new class could just always use 64-bit indices, but this might be too naive?
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
Yes, please follow NumPy. * "Object" dtype. So far, there isn't one. I wonder if it's needed or having
only numeric types would be enough.
This would be marginally useful -- eventually someone is going to want to store some strings in a sparse array, and NumPy doesn't handle this very well. Thus pandas, h5py and xarray all end up using dtype=object for variable length strings. (pandas/xarray even take the monstrous approach of using np.nan as a sentinel missing value.)
* Interoperation with numpy arrays and other sparse matrices. I guess __numpy_ufunc__ would *the* solution here, when available. For now, I do something simple based on special-casing and __array_priority__. Sparse matrices almost work, but there are glitches.
Yes, __array_priority__ is about the best we can do now. You could actually use a mix of __array_prepare__ and __array_wrap__ to make (non-generalized) ufuncs work, e.g., for functions like np.sin: - In __array_prepare__, return the non-fill values of the array concatenated with the fill value. - In __array_wrap__, reshape all but the last element to build a new sparse array, using the last element for the new fill value. This would be a neat trick and get you most of what you could hope for from __numpy_ufunc__.

Thanks Stephan,
A few other things small things I'd like to see: - Support for slicing, even if it's expensive.
Slicing is on the TODO list. Only needs a bit of plumbing work.
- A strict way to set the shape without automatic expansion, if desired (e.g., if shape is provided in the constructor).
You can set the initial shape in the constructor. Or you mean a flag to freeze the shape once it's set and have `__setitem__(out-of-bounds)` raise an error?
- Default to the dtype of the fill_value. NumPy does this for np.full.
Thanks for the suggestion --- done and implemented!
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
Yes, please follow NumPy.
One thing I'm wondering is the numpy rule that scalars never upcast arrays. Is it something people actually rely on? [In my experience, I only had to work around it, but my experience might be singular.]
You could actually use a mix of __array_prepare__ and __array_wrap__ to make (non-generalized) ufuncs work, e.g., for functions like np.sin:
- In __array_prepare__, return the non-fill values of the array concatenated with the fill value. - In __array_wrap__, reshape all but the last element to build a new sparse array, using the last element for the new fill value.
This would be a neat trick and get you most of what you could hope for from __numpy_ufunc__.
This is really neat indeed! I've flagged it in https://github.com/ev-br/sparr/issues/35 At the moment, I'm dealing with something much less cool, which I suspect I'm not the first one: given m a MapArray and csr a scipy.sparse matrix, - m * csr produces a MapArray holding the result of the elementwise multiplication, but - csr * m fails with the dimension mismatch error when dimensions are OK for elementwise multiply but not matrix multiply. The failure is somewhere in the scipy.sparse code. I tried playing with __array_priority__, but so far I did not manage to convince scipy.sparse matrices to defer cleanly to the right-hand multiplier (left-hand multiplier is OK). Cheers, Evgeni

For completeness I'd mention an alternative: https://github.com/scipy/scipy/pull/6004 This is a completely independent effort framed better for the scipy.sparse framework. On Mon, Mar 28, 2016 at 11:29 AM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
Hi,
TL;DR: Here's a draft implementation of a 'sparse array', suitable for incremental assembly. Comments welcome! https://github.com/ev-br/sparr
Scipy sparse matrices are generally great. One area where they are not very great is the formats suitable for incremental assembly (esp LIL and DOK matrices). This has been discussed for a while, see, for instance, https://github.com/scipy/scipy/pull/2908#discussion_r16955546
I spent some time to see how hard is it to build a dictionary-of-keys data structure without Python overhead. Turns out, not very hard.
Also, when constructing a sparse data structure now, it seems to make sense to make it a sparse *array* rather than *matrix*, so that it supports both elementwise and matrix multiplication.
A draft implementation is available at https://github.com/ev-br/sparr and I'd be grateful for any comments, large or small.
Some usage examples are in the README file which Github shows at the front page above.
First and foremost, I'd like to gauge interest in the community ;-). Does it actually make sense? Would you use such a data structure? What is missing in the current version?
Short to medium term, some issues I see are:
* 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types. I wonder if this is purely backwards compatibility. Naively, it seems to me that a new class could just always use 64-bit indices, but this might be too naive?
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
* "Object" dtype. So far, there isn't one. I wonder if it's needed or having only numeric types would be enough.
* Interoperation with numpy arrays and other sparse matrices. I guess __numpy_ufunc__ would *the* solution here, when available. For now, I do something simple based on special-casing and __array_priority__. Sparse matrices almost work, but there are glitches.
This is just a non-exhaustive list of things I'm thinking about. If there's something else (and I'm sure there is plenty!) I'm all ears --- either in this email thread or in the issue tracker. If someone feels like commenting on implementation details, sure, none of those is cast in stone and plumbing is definitely in a bit of a flux.
Cheers,
Evgeni

Another alternative: https://github.com/perimosocordiae/sparray I tried to follow the numpy API as closely as I could, to see if something like this could work. The code isn't particularly optimized, but it does beat CSR/CSC spmatrix formats on a few benchmarks. My plan going forward is to have swappable (opaque) backends to allow for tradeoffs in speed/memory for access/construction/math operations. Apologies for the lacking documentation! The only good listing of current capabilities at the moment is the test suite ( https://github.com/perimosocordiae/sparray/tree/master/sparray/tests). On Mon, Mar 28, 2016 at 11:19 AM, Evgeni Burovski < evgeny.burovskiy@gmail.com> wrote:
For completeness I'd mention an alternative: https://github.com/scipy/scipy/pull/6004 This is a completely independent effort framed better for the scipy.sparse framework.
Hi,
TL;DR: Here's a draft implementation of a 'sparse array', suitable for incremental assembly. Comments welcome! https://github.com/ev-br/sparr
Scipy sparse matrices are generally great. One area where they are not very great is the formats suitable for incremental assembly (esp LIL and DOK matrices). This has been discussed for a while, see, for instance, https://github.com/scipy/scipy/pull/2908#discussion_r16955546
I spent some time to see how hard is it to build a dictionary-of-keys data structure without Python overhead. Turns out, not very hard.
Also, when constructing a sparse data structure now, it seems to make sense to make it a sparse *array* rather than *matrix*, so that it supports both elementwise and matrix multiplication.
A draft implementation is available at https://github.com/ev-br/sparr and I'd be grateful for any comments, large or small.
Some usage examples are in the README file which Github shows at the front page above.
First and foremost, I'd like to gauge interest in the community ;-). Does it actually make sense? Would you use such a data structure? What is missing in the current version?
Short to medium term, some issues I see are:
* 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types. I wonder if this is purely backwards compatibility. Naively, it seems to me that a new class could just always use 64-bit indices, but this might be too naive?
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
* "Object" dtype. So far, there isn't one. I wonder if it's needed or having only numeric types would be enough.
* Interoperation with numpy arrays and other sparse matrices. I guess __numpy_ufunc__ would *the* solution here, when available. For now, I do something simple based on special-casing and __array_priority__. Sparse matrices almost work, but there are glitches.
This is just a non-exhaustive list of things I'm thinking about. If
On Mon, Mar 28, 2016 at 11:29 AM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote: there's
something else (and I'm sure there is plenty!) I'm all ears --- either in this email thread or in the issue tracker. If someone feels like commenting on implementation details, sure, none of those is cast in stone and plumbing is definitely in a bit of a flux.
Cheers,
Evgeni
SciPy-Dev mailing list SciPy-Dev@scipy.org https://mail.scipy.org/mailman/listinfo/scipy-dev

On Mon, Mar 28, 2016 at 5:37 PM, CJ Carey <perimosocordiae@gmail.com> wrote:
Another alternative: https://github.com/perimosocordiae/sparray
I tried to follow the numpy API as closely as I could, to see if something like this could work. The code isn't particularly optimized, but it does beat CSR/CSC spmatrix formats on a few benchmarks. My plan going forward is to have swappable (opaque) backends to allow for tradeoffs in speed/memory for access/construction/math operations.
Apologies for the lacking documentation! The only good listing of current capabilities at the moment is the test suite (https://github.com/perimosocordiae/sparray/tree/master/sparray/tests).
Wow, a great one from the Happy Valley! I've quite a bit of catching up to do --- you are miles ahead of me in terms of feature completeness. One difference I see is that you seem to be using a flat index. This likely does not play too well with assembly, which is what I tried to take into account. Would you be interested to join forces? Cheers, Evgeni

One difference I see is that you seem to be using a flat index. This likely does not play too well with assembly, which is what I tried to take into account.
Yeah, I took the easy route of using a single, sorted, flat index. This makes a lot of operations easy, especially with the help of np.ravel_multi_index and friends. You're correct that it makes assembly and other modifications to the sparsity structure difficult, though! For example: https://github.com/perimosocordiae/sparray/blob/master/sparray/base.py#L335
Would you be interested to join forces?
I've been meaning to add multiple backends to work around that problem, and I expect that merging our efforts will make that possible. Let me know how you want to proceed. -CJ

On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
Hi,
TL;DR: Here's a draft implementation of a 'sparse array', suitable for incremental assembly. Comments welcome! https://github.com/ev-br/sparr
Scipy sparse matrices are generally great. One area where they are not very great is the formats suitable for incremental assembly (esp LIL and DOK matrices). This has been discussed for a while, see, for instance, https://github.com/scipy/scipy/pull/2908#discussion_r16955546
I spent some time to see how hard is it to build a dictionary-of-keys data structure without Python overhead. Turns out, not very hard.
Also, when constructing a sparse data structure now, it seems to make sense to make it a sparse *array* rather than *matrix*, so that it supports both elementwise and matrix multiplication.
A draft implementation is available at https://github.com/ev-br/sparr and I'd be grateful for any comments, large or small.
Some usage examples are in the README file which Github shows at the front page above.
First and foremost, I'd like to gauge interest in the community ;-). Does it actually make sense? Would you use such a data structure? What is missing in the current version?
This is really cool! The continuing importance of scipy.sparse is basically the only reason np.matrix hasn't been deprecated yet. I've noodled around a little with designs for similar things at various points, so a few general comments based on that that you can take or leave :-) - For general operations on n-dimensional sparse arrays, COO format seems like the obvious thing to use. The strategy I've envisioned is storing the data as N arrays of indices + 1 array of values (or possibly you'd want to pack these together into a single array of (struct containing N+1 values)), plus a bit of metadata indicating the current sort order -- so sort_order=(0, 1) would mean that the items are sorted lexicographically by row and then by column (so like CSC), and sort_order=(1, 0) would mean the opposite. Then operations like sum(axis=1) would be "sort by axis 1 (if not already there); then sum over each contiguous chunk of the data"; sum(axis=(0, 2)) would be "sort to make sure axes 0 and 2 are in the front (but I don't care in which order)", etc. It may be useful to use a stable sort, so that if you have sort_order=(0, 1, 2) and then sort by axis 2, you get sort_order=(2, 0, 1). For binary operations, you ensure that the two arrays have the same sort order (possibly re-sorting the smaller one), and then do a parallel iteration over both of them finding matching indices. - Instead of proliferating data types for all kinds of different storage formats, I'd really prefer to see a single data type that "just works". So for example, instead of having one storage format that's optimized for building matrices and one that's optimized for operations on matrices, you could have a single object which has: - COO format representation - a DOK of "pending updates" and writes would go into the DOK part, and at the beginning of each non-trivial operation you'd call self._get_coo_arrays() or whatever, which would do if self._pending_updates: # do a single batch reallocation and incorporate all the pending updates return self._real_coo_arrays There are a lot of potentially viable variants on this using different data structures for the various pieces, different detailed reallocation strategies, etc., but you get the idea. For interface with external C/C++/Fortran code that wants CSR or CSC, I'd just have get_csr() and get_csc() methods that return a tuple of (indptr, indices, data), and not even try to implement any actual operations on this representation.
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
It would be really nice if you could just use numpy's own dtype objects and ufuncs instead of implementing your own. I don't know how viable that is right now, but I think it's worth trying, and if you can't then I'd like to know what part broke :-). Happy to help if you need some tips / further explanation. -n -- Nathaniel J. Smith -- https://vorpus.org

Hi Nathaniel,
- For general operations on n-dimensional sparse arrays, COO format seems like the obvious thing to use. The strategy I've envisioned is storing the data as N arrays of indices + 1 array of values (or possibly you'd want to pack these together into a single array of (struct containing N+1 values)), plus a bit of metadata indicating the current sort order -- so sort_order=(0, 1) would mean that the items are sorted lexicographically by row and then by column (so like CSC), and sort_order=(1, 0) would mean the opposite. Then operations like sum(axis=1) would be "sort by axis 1 (if not already there); then sum over each contiguous chunk of the data"; sum(axis=(0, 2)) would be "sort to make sure axes 0 and 2 are in the front (but I don't care in which order)", etc. It may be useful to use a stable sort, so that if you have sort_order=(0, 1, 2) and then sort by axis 2, you get sort_order=(2, 0, 1). For binary operations, you ensure that the two arrays have the same sort order (possibly re-sorting the smaller one), and then do a parallel iteration over both of them finding matching indices.
- Instead of proliferating data types for all kinds of different storage formats, I'd really prefer to see a single data type that "just works". So for example, instead of having one storage format that's optimized for building matrices and one that's optimized for operations on matrices, you could have a single object which has:
- COO format representation - a DOK of "pending updates"
and writes would go into the DOK part, and at the beginning of each non-trivial operation you'd call self._get_coo_arrays() or whatever, which would do
if self._pending_updates: # do a single batch reallocation and incorporate all the pending updates return self._real_coo_arrays
There are a lot of potentially viable variants on this using different data structures for the various pieces, different detailed reallocation strategies, etc., but you get the idea.
Thanks for sharing your perspective! I really like the idea of the DOK buffer of pending updates. I'll note that what the COO structure you describe looks similar to CJ's SpArray which he mentioned in a parallel thread, and my MapArray can serve as a buffer. Tacking them together seems easy, so one can fairly easily test-drive what you describe. Looking a bit further though, I doubt that one-size-fits-all data data format is realistic. For one, there's a balance to strike between a optimized 2D structure (e.g., PySparse's ll_mat) and a general n-D structure, and there will always be at least one application which requires one or another. (My crude experiments so far give about a factor of 2--5 for CPU and/or memory footprint depending on general tradeoffs.) Ultimately then, I think it makes sense to consider three separate layers: a python frontend, a low-level data structure and a middle layer which transforms, e.g., the logic required by python's __setitem__ into iterators of the low-level data structure. For shaping up this middle layer, an std::map is fine, esp because it's already implemented :-).
For interface with external C/C++/Fortran code that wants CSR or CSC, I'd just have get_csr() and get_csc() methods that return a tuple of (indptr, indices, data), and not even try to implement any actual operations on this representation.
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
It would be really nice if you could just use numpy's own dtype objects and ufuncs instead of implementing your own. I don't know how viable that is right now, but I think it's worth trying, and if you can't then I'd like to know what part broke :-). Happy to help if you need some tips / further explanation.
For ufuncs, it's simple: it does not work ATM :-( . I tried simple conbinations of __array__, __array_prepare__ and __array_finalize__, and they fail for a reason or another. Some details are available in https://github.com/ev-br/sparr/issues/35 In a nutshell, these can probably made to work with some rather simple modifications of to __array_prepare__/__array_wrap__, but I guess it's not very likely given that __numpy_ufunc__ is supposed to come to town, which I'm looking forward to :-). For dtypes, I currently just dispatch on numpy typecodes. Do you mean that, or do you mean actual PyArray_Descr objects? Those I'm not using, and I do not see how to use them, likely because I simply do not understand what they actually are :-). In that department, I'd really really appreciate an explanation! Cheers, Evgeni

On Thu, Apr 14, 2016 at 1:14 PM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
Hi Nathaniel, [...] I really like the idea of the DOK buffer of pending updates. I'll note that what the COO structure you describe looks similar to CJ's SpArray which he mentioned in a parallel thread, and my MapArray can serve as a buffer. Tacking them together seems easy, so one can fairly easily test-drive what you describe.
Cool :-)
Looking a bit further though, I doubt that one-size-fits-all data data format is realistic. For one, there's a balance to strike between a optimized 2D structure (e.g., PySparse's ll_mat) and a general n-D structure, and there will always be at least one application which requires one or another. (My crude experiments so far give about a factor of 2--5 for CPU and/or memory footprint depending on general tradeoffs.)
Ultimately then, I think it makes sense to consider three separate layers: a python frontend, a low-level data structure and a middle layer which transforms, e.g., the logic required by python's __setitem__ into iterators of the low-level data structure. For shaping up this middle layer, an std::map is fine, esp because it's already implemented :-).
I guess my tactical preference would be to start by implementing a one-size-fits-all data format, make it as awesome as one can, and then only add extra special-cases once (if) it becomes clear that there is really a need. ll_mat is just a list-of-lists format, right? why would we need that? And CSR/CSC are nice, but even for 2-d matrices the savings compared to COO seem underwhelming. I guess CSR/CSC are at most 30% smaller than COO, and if you need random access to rows/columns then you can do that in O(1) instead of O(log log n)? Obviously there will always be some specialized applications where this matters, but...
For interface with external C/C++/Fortran code that wants CSR or CSC, I'd just have get_csr() and get_csc() methods that return a tuple of (indptr, indices, data), and not even try to implement any actual operations on this representation.
* Data types and casting rules. For now, I basically piggy-back on numpy's rules. There are several slightly different ones (numba has one?), and there might be an opportunity to simplify the rules. OTOH, inventing one more subtly different set of rules might be a bad idea.
It would be really nice if you could just use numpy's own dtype objects and ufuncs instead of implementing your own. I don't know how viable that is right now, but I think it's worth trying, and if you can't then I'd like to know what part broke :-). Happy to help if you need some tips / further explanation.
For ufuncs, it's simple: it does not work ATM :-( . I tried simple conbinations of __array__, __array_prepare__ and __array_finalize__, and they fail for a reason or another. Some details are available in https://github.com/ev-br/sparr/issues/35 In a nutshell, these can probably made to work with some rather simple modifications of to __array_prepare__/__array_wrap__, but I guess it's not very likely given that __numpy_ufunc__ is supposed to come to town, which I'm looking forward to :-).
Yeah, you probably do need __numpy_ufunc__. But in the mean time you could see about something like def generic_ufunc_adapter(ufunc, *args, **kwargs): ... do something clever ... def add(*args, **kwargs): return generic_ufunc_adapter(np.add, *args, **kwargs) and then be prepared to switch over to __numpy_ufunc__ when available. The big advantage of this approach is that you can hopefully handle all ufuncs, including user-defined ufuncs (e.g. scipy.special, numba @vectorize functions) without having to special case all of them.
For dtypes, I currently just dispatch on numpy typecodes. Do you mean that, or do you mean actual PyArray_Descr objects? Those I'm not using, and I do not see how to use them, likely because I simply do not understand what they actually are :-). In that department, I'd really really appreciate an explanation!
I meant PyArray_Descr objects :-). They're a Python object whose C layout looks like this: https://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#c... The crucial bit is that PyArray_ArrFuncs pointer at the end. That points to a table of function pointers: https://docs.scipy.org/doc/numpy/reference/c-api.types-and-structures.html#c... and those function pointers implement generic operations like "copy elements of this dtype from this strided array to that strided array", "convert a Python object into a buffer using this dtype" (used by e.g. arr[0] = some_object), etc. They're not terribly well documented, and we need to rework that API (but will have to preserve those entrypoints for backcompat purposes), but this is how numpy.ndarray is able to work with arbitrary dtypes without dispatching on typecodes. -n -- Nathaniel J. Smith -- https://vorpus.org
participants (4)
-
CJ Carey
-
Evgeni Burovski
-
Nathaniel Smith
-
Stephan Hoyer