Minimal sparse array interface
Hi everyone, Several times now, I've needed exactly what is available in scipy.sparse but with an array interface, instead of a matrix interface. Having such an interface would also simplify matters for libraries such as NetworkX and scikit-learn. I am aware of pydata.sparse, but that seems like quite a bit more than is needed here (higher dimensions, optimizations via numba; I'm not clear on the linear algebra status). Would it make sense to add a minimal interface on top of the SciPy sparse matrices that exposes a NumPy array interface? Eventually that could become the default (although it probably implies a larger refactor of the underlying classes). Best regards, Stéfan
Hi Stefan, You're thinking something like a lightweight `csr_array` which is almost identical to the existing csr_matrix, but has a np.array-like multiplication? I think it'd be a nice thing to have! Several things to consider upfront: 1. What's the minimal subset of formats? Would CSR be enough, or some others are needed from the start? 2. Should this live in scipy.sparse or keep it in a separate repo in the scipy org for a while? 3.What about d != 2 ? E.g., does array[0, :] behave as a 1D numpy array. What about array[:, :, None]. Does it broadcast? (This can develop into quite a rabbit hole unless defined very clearly from the start I suspect.) 4. The interaction with scipy.sparse.linalg. Cheers, Evgeni On Thu, Jun 3, 2021 at 8:54 PM Stefan van der Walt <stefanv@berkeley.edu> wrote:
Hi everyone,
Several times now, I've needed exactly what is available in scipy.sparse but with an array interface, instead of a matrix interface. Having such an interface would also simplify matters for libraries such as NetworkX and scikit-learn.
I am aware of pydata.sparse, but that seems like quite a bit more than is needed here (higher dimensions, optimizations via numba; I'm not clear on the linear algebra status).
Would it make sense to add a minimal interface on top of the SciPy sparse matrices that exposes a NumPy array interface? Eventually that could become the default (although it probably implies a larger refactor of the underlying classes).
Best regards, Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
Hi Evgeni, On Thu, Jun 3, 2021, at 12:34, Evgeni Burovski wrote:
You're thinking something like a lightweight `csr_array` which is almost identical to the existing csr_matrix, but has a np.array-like multiplication?
Exactly, and that also conforms with indexing expectations.
1. What's the minimal subset of formats? Would CSR be enough, or some others are needed from the start?
I suspect we will have to cover most of them in order for this to be useful. I just looked at NetworkX, for example, and they use CSR, LIL, and COO at least.
2. Should this live in scipy.sparse or keep it in a separate repo in the scipy org for a while?
Most of it we can probably build separately, but it may require a few tweaks to the existing classes. Those changes could simply be PRs (once there is rough consensus that we want to go this route).
3.What about d != 2 ? E.g., does array[0, :] behave as a 1D numpy array. What about array[:, :, None]. Does it broadcast? (This can develop into quite a rabbit hole unless defined very clearly from the start I suspect.)
Indexing is probably the trickiest part to get right. A first thought would be to limit any interactions that produce >2D, and to introduce 1d and 0d. I'd have to double check, but it may be possible to do shape calculations, follow that by relying on existing operations to produce a result, and then adjust that result to conform to the expected shape.
4. The interaction with scipy.sparse.linalg.
We'd need some way to let `scipy.sparse.linalg` know what's going on. One way would be to have `linalg` call `as_sparse_matrix` or `as_sparse_array` explicitly, depending on its needs. Stéfan
I wrote a lightweight sparse ndarray library a few years ago: https://github.com/perimosocordiae/sparray It uses Cython for some hot loops, otherwise it's fairly simple. It uses a flattened COO-like format that's pretty efficient for common ndarray use cases. I was planning to add other backends, but then I finished my degree and got a Real Job. Around the same time, Evgeni was experimenting with https://github.com/ev-br/sparr, which uses a generalized DOK format for sparse ndarrays. Both of these haven't seen development or use in a while, but if you're looking to put a sparse ndarray together they might be useful starting points. On Thu, Jun 3, 2021 at 4:03 PM Stefan van der Walt <stefanv@berkeley.edu> wrote:
Hi Evgeni,
On Thu, Jun 3, 2021, at 12:34, Evgeni Burovski wrote:
You're thinking something like a lightweight `csr_array` which is almost identical to the existing csr_matrix, but has a np.array-like multiplication?
Exactly, and that also conforms with indexing expectations.
1. What's the minimal subset of formats? Would CSR be enough, or some others are needed from the start?
I suspect we will have to cover most of them in order for this to be useful. I just looked at NetworkX, for example, and they use CSR, LIL, and COO at least.
2. Should this live in scipy.sparse or keep it in a separate repo in the scipy org for a while?
Most of it we can probably build separately, but it may require a few tweaks to the existing classes. Those changes could simply be PRs (once there is rough consensus that we want to go this route).
3.What about d != 2 ? E.g., does array[0, :] behave as a 1D numpy array. What about array[:, :, None]. Does it broadcast? (This can develop into quite a rabbit hole unless defined very clearly from the start I suspect.)
Indexing is probably the trickiest part to get right. A first thought would be to limit any interactions that produce >2D, and to introduce 1d and 0d. I'd have to double check, but it may be possible to do shape calculations, follow that by relying on existing operations to produce a result, and then adjust that result to conform to the expected shape.
4. The interaction with scipy.sparse.linalg.
We'd need some way to let `scipy.sparse.linalg` know what's going on. One way would be to have `linalg` call `as_sparse_matrix` or `as_sparse_array` explicitly, depending on its needs.
Stéfan _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
HI CJ, On Thu, Jun 3, 2021, at 21:39, CJ Carey wrote:
I wrote a lightweight sparse ndarray library a few years ago: https://github.com/perimosocordiae/sparray
It uses Cython for some hot loops, otherwise it's fairly simple. It uses a flattened COO-like format that's pretty efficient for common ndarray use cases. I was planning to add other backends, but then I finished my degree and got a Real Job.
I presume sparray does not work with the linear algebra facilities in SciPy?
Both of these haven't seen development or use in a while, but if you're looking to put a sparse ndarray together they might be useful starting points.
I'll definitely look at the indexing logic in particular, thanks! Stéfan
On Fri, Jun 4, 2021 at 3:51 PM Stefan van der Walt <stefanv@berkeley.edu> wrote:
HI CJ,
On Thu, Jun 3, 2021, at 21:39, CJ Carey wrote:
I wrote a lightweight sparse ndarray library a few years ago: https://github.com/perimosocordiae/sparray
It uses Cython for some hot loops, otherwise it's fairly simple. It uses a flattened COO-like format that's pretty efficient for common ndarray use cases. I was planning to add other backends, but then I finished my degree and got a Real Job.
I presume sparray does not work with the linear algebra facilities in SciPy?
I haven't tested for scipy.linalg compatibility specifically. I expect we might need something similar to the existing is_pydata_spmatrix() check: https://github.com/scipy/scipy/blob/f89d39901af543a1a656585f8b8f3ec059788e1e... That said, I just picked an example at random to test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.exp... And here's the result: In [1]: from scipy.sparse import csc_matrix In [2]: from scipy.sparse.linalg import expm, expm_multiply In [3]: A = csc_matrix([[1, 0], [0, 1]]) In [4]: B = np.array([np.exp(-1.), np.exp(-2.)]) In [5]: expm_multiply(A, B, start=1, stop=2, num=3, endpoint=True) Out[5]: array([[1. , 0.36788], [1.64872, 0.60653], [2.71828, 1. ]]) In [6]: import sparray In [7]: A2 = sparray.FlatSparray.from_spmatrix(A) In [8]: expm_multiply(A2, B, start=1, stop=2, num=3, endpoint=True) Out[8]: array([[1. , 0.36788], [1.64872, 0.60653], [2.71828, 1. ]])
On Fri, Jun 4, 2021, at 13:53, CJ Carey wrote:
That said, I just picked an example at random to test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.exp...
This may be the way to guarantee that we do it right: parameterize the test suite, and test it with both types of objects. Stéfan
participants (3)
-
CJ Carey -
Evgeni Burovski -
Stefan van der Walt