# [Numpy-discussion] Help in speeding up accumulation in a matrix

eat e.antero.tammi at gmail.com
Sat Jan 29 17:56:10 EST 2011

```Hi,

On Sat, Jan 29, 2011 at 11:01 PM, Nicolas SCHEFFER <
scheffer.nicolas at gmail.com> wrote:

> Hi all,
>
> First email to the list for me, I just want to say how grateful I am
> to have python+numpy+ipython etc... for my day to day needs. Great
> combination of software.
>
> Anyway, I've been having this bottleneck in one my algorithms that has
> been bugging me for quite a while.
> The objective is to speed this part up. I've been doing tons of
> optimization and parallel processing around that piece of code to get
> a decent run time.
>
> The problem is easy. You want to accumulate in a matrix, a weighted
> sum of other matrices. Let's call this function scale_and_add:
>        R+=w[i]*Ms[i]
>    return R
> This 'for' loop bugs me since I know this will slow things down.
>
> But the dimension of these things are so large that any attempt to
> vectorize this is slower and takes too much memory.
> I typically work with 1000 weights and matrices, matrices of dimension
> of several hundred.
>
> My current config is:
> 1 loops, best of 3: 392 ms per loop
>
> In [3]: w.shape
> Out[3]: (2000,)
>
> In [4]: Ms.shape
> Out[4]: (2000, 200, 200)
> I'd like to be able to double these dimensions.

How this array Ms is created?  Do you really need to have it in the memory
as whole?

Assuming it's created by (200, 200) 'chunks' at a time, then you could
accumulate that right away to R. It still involves Python looping but that's

My 2 cents
eat
For instance I could use broadcasting by using a dot product
%timeit dot(Ms.T,w)
1 loops, best of 3: 1.77 s per loop
But this is i) slower ii) takes too much memory
(btw, I'd really need an inplace dot-product in numpy to avoid the
copy in memory, like the blas call in scipy.linalg. But that's for an

The matrices are squared and symmetric. I should be able to get
something out of this, but I never found anything related to this in
Numpy.

I also tried a Cython reimplementation
1 loops, best of 3: 393 ms per loop
It brought nothing in speed.

Here's the code
@cython.boundscheck(False)
ndim=1] w, np.ndarray[float, ndim=3] Ms):

@cython.boundscheck(False)
np.ndarray[float, ndim=1] w, np.ndarray[float, ndim=3] Ms):
cdef unsigned int mdim
cdef unsigned int i
for i from 0 <= i < nb_add:
R+=w[i]*Ms[i]
#for j in range(mdim):
#    for k in range(mdim):
#        R[j][k]+=w[i]*Ms[i][j][k]

return 0

So here's my question if someone has time to answer it: Did I try
anything possible? Should I move along and deal with this bottleneck?
Or is there something else I didn't think of?

Thanks for reading, keep up the great work!

-n
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion at scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110130/cb5ff9a9/attachment.html>
```