On Mon, Mar 28, 2016 at 3:29 AM, Evgeni Burovski <evgeny.burovskiy@gmail.com> wrote:
First and foremost, I'd like to gauge interest in the community ;-).
Does it actually make sense? Would you use such a data structure? What is
missing in the current version?

This looks awesome, and makes complete sense to me! In particular, xarray could really use an n-dimensional sparse structure.

A few other things small things I'd like to see:
- Support for slicing, even if it's expensive.
- A strict way to set the shape without automatic expansion, if desired (e.g., if shape is provided in the constructor).
- Default to the dtype of the fill_value. NumPy does this for np.full.
 
Short to medium term, some issues I see are:

* 32-bit vs 64-bit indices. Scipy sparse matrices switch between index types.
I wonder if this is purely backwards compatibility. Naively, it seems to me
that a new class could just always use 64-bit indices, but this might
be too naive?

* Data types and casting rules. For now, I basically piggy-back on
numpy's rules.
There are several slightly different ones (numba has one?), and there might be
an opportunity to simplify the rules. OTOH, inventing one more subtly different
set of rules might be a bad idea.

Yes, please follow NumPy. 

* "Object" dtype. So far, there isn't one. I wonder if it's needed or having
only numeric types would be enough.

This would be marginally useful -- eventually someone is going to want to store some strings in a sparse array, and NumPy doesn't handle this very well. Thus pandas, h5py and xarray all end up using dtype=object for variable length strings. (pandas/xarray even take the monstrous approach of using np.nan as a sentinel missing value.)
 
* Interoperation with numpy arrays and other sparse matrices. I guess
__numpy_ufunc__ would *the* solution here, when available.
For now, I do something simple based on special-casing and __array_priority__.
Sparse matrices almost work, but there are glitches.

Yes, __array_priority__ is about the best we can do now.

You could actually use a mix of __array_prepare__ and __array_wrap__ to make (non-generalized) ufuncs work, e.g., for functions like np.sin:

- In __array_prepare__, return the non-fill values of the array concatenated with the fill value.
- In __array_wrap__, reshape all but the last element to build a new sparse array, using the last element for the new fill value.

This would be a neat trick and get you most of what you could hope for from __numpy_ufunc__.