[SciPy-User] Sparse vector

Fri Apr 23 13:57:24 EDT 2010

A Wednesday 21 April 2010 23:01:14 Felix Schlesinger escrigué:
> Storing indicies seperate is a good option sometimes but does not allow
>  random access. Are you aware of any fast hash-table implementation that
>  can store atomic values (e.g. numpy floats) instead of general python
>  objects? That would go a long way towards sparse vectors.

What I'd suggest is not exactly a hash-table implementation, but rather using 
compression for keeping you sparse vectors (or arrays).  Then use a package 
that can deal with typical numpy math operations and slicing but with 
compressed containers instead of ndarrays, and you will have something very 
close to what you are looking for.

Here it is a simple example with PyTables, showing how you can deal with that 
efficiently:

In [1]: import numpy as np
In [2]: import tables as tb

# Create a sparse vector of length N (you can use matrices too)
In [4]: N = 100*1000*1000
In [5]: a0 = np.zeros(N, dtype='i4')
# Fill a0 with some sparse values (1% fill ratio) on it
In [6]: l = np.random.randint(0, N, N/100).astype('i4')
In [7]: a0[l] = 2*l+3

# Save it in a file using compression
In [9]: f = tb.openFile("/tmp/vectors.h5", "w")
# Use the highly efficent Blosc compressor (http://blosc.pytables.org)
In [10]: filters = tb.Filters(complib='blosc', complevel=9, shuffle=0)
In [11]: v0 = f.createCArray(f.root, 'v0', tb.Int32Atom(), shape=(N,), 
filters=filters)
In [12]: v0[:] = a0
In [13]: a0    # a0 is a typical ndarray container (data is in-memory)
Out[13]: array([0, 0, 0, ..., 0, 0, 0], dtype=int32)
In [14]: v0    # v0 is a CArray container (data is on-disk)
Out[14]:
/v0 (CArray(100000000,), blosc(9)) ''
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (32768,)

# Show the sizes of a0 (in-memory) and v0 (on-disk)
In [15]: a0.size*a0.itemsize
Out[15]: 400000000    # ~ 400 MB
In [16]: !ls -lh /tmp/vectors.h5
-rw-r--r-- 1 faltet users 8,2M 23 abr 19:15 /tmp/vectors.h5
# so, the compressed vector v0 takes ~ 8 MB on disk,
# while ndarray a0 takes ~ 400 MB

# Now, run some slicing operations on v0
# Single element
In [19]: timeit -r1 -n1 v0[1000]
1 loops, best of 1: 129 µs per loop
# Slice
In [20]: timeit -r1 -n1 v0[1000:10000]
1 loops, best of 1: 162 µs per loop
# Fancy indexing
In [21]: timeit -r1 -n1 v0[np.random.randint(0, N, 10000)]
1 loops, best of 1: 165 ms per loop

# Of course, ndarray indexing is way faster
In [22]: timeit -r1 -n1 a0[1000]
1 loops, best of 1: 4.05 µs per loop
In [23]: timeit -r1 -n1 a0[1000:10000]
1 loops, best of 1: 5.01 µs per loop
In [24]: timeit -r1 -n1 a0[np.random.randint(0, N, 10000)]
1 loops, best of 1: 1.39 ms per loop

# But pytables let's operate with arrays on-disk no matter how large are they
# Firstly, create another vector both in-memory and on-disk
In [25]: a1 = a0/2  # a1 is a new ndarray in-memory
In [26]: v1 = f.createCArray(f.root, 'v1', tb.Int32Atom(), shape=(N,), 
filters=filters)
In [27]: v1[:] = a1  # dump a1 ndarray into v1 CArray on disk

# Now, perform an arithmetic operation completely on-disk
In [29]: expr = tb.Expr('3*v1-2*v0')  # the expression to compute
# Create an array for output
In [31]: r = f.createCArray(f.root, 'result', tb.Int32Atom(), (N,), 
filters=filters)
In [32]: expr.setOutput(r)  # the output for expression will be 'r'
# Perform the computation!
In [33]: timeit -r1 -n1 expr.eval()
1 loops, best of 1: 1.65 s per loop
# numpy is still faster, but not a great deal
In [34]: timeit -r1 -n1 3*a1-2*a0
1 loops, best of 1: 1.09 s per loop

[You will need the version 2.2b3 of PyTables (or trunk) for doing this]

The advantage of the CArray objects in PyTables is that you can have as many 
of them as you want, the only limitation being your available disk space. 
Also, as the CArray containers are compressed, the filesystem OS can hold many 
hundreds of GB of sparsed data simultaneously, which results in much better 
performance during array mathematical computations (almost reaching pure NumPy 
speed).

Not exactly a sparse matrix implementation, but could make your day.

-- 
Francesc Alted