[SciPy-User] fancy indexation of sparse matrices is terrible slow

Tue Dec 15 08:32:47 EST 2009

2009/12/15 Dmitrey <tmp50 at ukr.net>:
>
> Because some optimization problems can have constraints gradient of size
> nVars*nConstraints very huge.  nVars are up to ~100'000, and nConstraints
> are up to 200'000'000. I will just have memory overflow if I will try
> casting it to dense each time.

I looked at the CSR fancy indexing code and realized it was much worse
than I originally thought so I implemented a faster path in C++ for
the operation.  With SciPy r6139 (
http://projects.scipy.org/scipy/changeset/6139 ) I get a considerable
improvement with a variation of your example:

from scipy.sparse import csr_matrix, find
from numpy import where, zeros, prod, vstack, ones
from time import time
n, m = 1000, 20000
M = vstack((zeros((n, m)), ones((25, m))))
I, J = where(M)

# 1: time for lil
M = csr_matrix(M)
print 'sparsity of the matrix:', float(M.size)/prod(M.shape)

t = time()
M[I, J]
print('time elapsed with csr: %f' % (time()-t))

# 2: time for dense
t = time()
M = M.A # even with this time elapsed is less
M[I, J]
print('time elapsed with ndarray: %f' % (time()-t))

# output:
# sparsity of the matrix: 0.0243902439024
# time elapsed with csr: 0.044210
# time elapsed with ndarray: 0.197938

> BTW, why return type of numpy.where is int64, but type of scipy.sparse.find
> is numpy.int32? I had spent lots of time to search why ipopt hadn't work
> with that data, and it turned out it is due to this type difference (maybe
> some issues related to connecting Python code to C++). Also, does it mean
> that I cannot create sparse matrices with one dimension greater than
> max(int32)  instead of max(int64)?

That's correct, max(int32) is the largest size you can use.

There are a few reasons for this limitation.  The first is that few
people will have matrices with more than 2B rows or columns since most
meaningful matrices of that size would require 100GB+ of memory.  The
second is that external libraries (SuperLU, UMFPACK, ARPACK, etc.)
don't necessarily support larger indices, or our wrappers for those
libraries don't necessarily support it right now.  Other SciPy
components, like io.savemat(), would also have to be checked as well.

The C++ backend of the sparse module (sparsetools) is fully templated
so adding support for 64-bit integers is fairly trivial there.
However, it would have the downside of doubling the compilation time
and executable size.

-- 
Nathan Bell wnbell at gmail.com
http://www.wnbell.com/