[Numpy-discussion] Help in speeding up accumulation in a matrix

Sat Jan 29 16:01:45 EST 2011

Hi all,

First email to the list for me, I just want to say how grateful I am
to have python+numpy+ipython etc... for my day to day needs. Great
combination of software.

Anyway, I've been having this bottleneck in one my algorithms that has
been bugging me for quite a while.
The objective is to speed this part up. I've been doing tons of
optimization and parallel processing around that piece of code to get
a decent run time.

The problem is easy. You want to accumulate in a matrix, a weighted
sum of other matrices. Let's call this function scale_and_add:
def scale_and_add_re(R,w,Ms):
    (nb_add,mdim,j)=np.shape(Ms)
    for i in range(nb_add):
        R+=w[i]*Ms[i]
    return R
This 'for' loop bugs me since I know this will slow things down.

But the dimension of these things are so large that any attempt to
vectorize this is slower and takes too much memory.
I typically work with 1000 weights and matrices, matrices of dimension
of several hundred.

My current config is:
In [2]: %timeit scale_and_add_re(R,w,Ms)
1 loops, best of 3: 392 ms per loop

In [3]: w.shape
Out[3]: (2000,)

In [4]: Ms.shape
Out[4]: (2000, 200, 200)
I'd like to be able to double these dimensions.

For instance I could use broadcasting by using a dot product
%timeit dot(Ms.T,w)
1 loops, best of 3: 1.77 s per loop
But this is i) slower ii) takes too much memory
(btw, I'd really need an inplace dot-product in numpy to avoid the
copy in memory, like the blas call in scipy.linalg. But that's for an
other thread...)

The matrices are squared and symmetric. I should be able to get
something out of this, but I never found anything related to this in
Numpy.

I also tried a Cython reimplementation
%timeit scale_and_add_reg(L1,w,Ms)
1 loops, best of 3: 393 ms per loop
It brought nothing in speed.

Here's the code
@cython.boundscheck(False)
def scale_and_add_reg(np.ndarray[float, ndim=2] R, np.ndarray[float,
ndim=1] w, np.ndarray[float, ndim=3] Ms):
    return _scale_and_add_reg(R,w,Ms)

@cython.boundscheck(False)
cdef int _scale_and_add_reg(np.ndarray[float, ndim=2] R,
np.ndarray[float, ndim=1] w, np.ndarray[float, ndim=3] Ms):
    cdef unsigned int mdim
    cdef int nb_add
    (nb_add,mdim,j)=np.shape(Ms)
    cdef unsigned int i
    for i from 0 <= i < nb_add:
        R+=w[i]*Ms[i]
        #for j in range(mdim):
        #    for k in range(mdim):
        #        R[j][k]+=w[i]*Ms[i][j][k]

    return 0

So here's my question if someone has time to answer it: Did I try
anything possible? Should I move along and deal with this bottleneck?
Or is there something else I didn't think of?

Thanks for reading, keep up the great work!

-n