Which sparse matrix package?

Thu Dec 18 18:42:10 EST 2008

On Dec 18, 11:52 pm, MRAB <goo... at mrabarnett.plus.com> wrote:
> James Mills wrote:
> > On Fri, Dec 19, 2008 at 8:18 AM, Martin Manns <mma... at gmx.net> wrote:
> >> Hi:
>
> > Hi,
>
> >> I am writing a spreadsheet application in Python
>
> > What's wrong with pyspread ?
>
> > [ ... snip ... ]
>
> >> The dict that I tried out is of the type:
>
> >> {(1,2,3): "2323", (1,2,545): "2324234", ... }
>
> >> It is too slow for my application when it grows. One slicing operation
> >> with list comprehensions takes about 1/2 s on my computer for 1E6
> >> elements.
>
> > Let me get this straight.
> > It's taking 0.5s to slice your matrix
> > of 1E7 (10000000.0 rows/columns)
>
> > Are you mad ? This is TEN Millions and you
> > required it faster than 0.5s ?
>
> No, 1E6, or ONLY 1 million. :-)

what does Internet explorer 6 & 7 have to do with that ? ;-)

>
> > Am I missing something here ?
>
> Would a matrix of matrices work better, ie a supermatrix where each
> element is either a submatrix of cells or None if the submatrix is
> empty? How much memory it would save would depend on how sparse the
> matrix is.

sparse matrix efficiency is all about guessing the probability for a
cell to be empty. My guess is that the probability for a cell (i,j) to
be empty grows with norm(i,j)

then, you should store your data in a list, and the indexes in another
"list", then use a BTree like algorithm, or keep you data ordered by
norm(i,j) growing.

something like:

data = [] #
kernel =[] # full of couples (i,j)

def add(i,j,value):
    data.append(value) # or put in the "right place" to keep it sorted
    kernel.append(i,j) # same comment

def get(i,j):
    k = kernel_of(i,j)
    return data[k]
def set(i,j,value):
   if is_probably_empty():
      k = kernel_of(i,j)
      if k:
        data[k]= value
      else: add(i,j,value)

   else:
      add(i,j,value)

where the 'kernel_of method" can take advantage of the sorted order
(i*j or max(i,j) should be as good) to make a fast dichotomy.

The set method is about finding if the cell is 'empty or not' and then
either perform an indexof, or a add.
to efficiently know if a cell is empty (the is_probably_empty method)
or not, try to use the bloom filter using
i*j%m as hash function (for instance)

here arrays have the size of your actual data (maybe thousands of cell
for a huge spreadsheet), that's MUCH better than the size of the
maxN*maxM

if my memory is correct, there is room in numpy for your own sparse
matrix implementation. You should try it with your own sparse matrix
implementation.

--
Eric
http://codeslash.blogspot.com