[Numpy-discussion] Help in speeding up accumulation in a matrix

Sun Jan 30 11:12:50 EST 2011

Am 29.1.2011 um 22:01 schrieb Nicolas SCHEFFER:

> Hi all,
> 
> First email to the list for me, I just want to say how grateful I am
> to have python+numpy+ipython etc... for my day to day needs. Great
> combination of software.
> 
> Anyway, I've been having this bottleneck in one my algorithms that has
> been bugging me for quite a while.
> The objective is to speed this part up. I've been doing tons of
> optimization and parallel processing around that piece of code to get
> a decent run time.
> 
> The problem is easy. You want to accumulate in a matrix, a weighted
> sum of other matrices. Let's call this function scale_and_add:
> def scale_and_add_re(R,w,Ms):
>    (nb_add,mdim,j)=np.shape(Ms)
>    for i in range(nb_add):
>        R+=w[i]*Ms[i]
>    return R
> This 'for' loop bugs me since I know this will slow things down.
> 
> But the dimension of these things are so large that any attempt to
> vectorize this is slower and takes too much memory.
> I typically work with 1000 weights and matrices, matrices of dimension
> of several hundred.
> 
> My current config is:
> In [2]: %timeit scale_and_add_re(R,w,Ms)
> 1 loops, best of 3: 392 ms per loop
> 
> In [3]: w.shape
> Out[3]: (2000,)
> 
> In [4]: Ms.shape
> Out[4]: (2000, 200, 200)
> I'd like to be able to double these dimensions.
> 
> For instance I could use broadcasting by using a dot product
> %timeit dot(Ms.T,w)
> 1 loops, best of 3: 1.77 s per loop
> But this is i) slower ii) takes too much memory
> (btw, I'd really need an inplace dot-product in numpy to avoid the
> copy in memory, like the blas call in scipy.linalg. But that's for an
> other thread...)

If you use a different memory layout for your data, you can improve your performance with dot:

MsT = Ms.T.copy()
%timeit np.dot(M,w)

10 loops, best of 3: 107 ms per loop

for comparison:
In [32]: %timeit scale_and_add_re(R,w,Ms)
1 loops, best of 3: 245 ms per loop

I don't think you can do much better than this. The above value gives a memory bandwidth for accessing Ms of 6 GB/s, I think the hardware limit for my system is about 10 Gb/s. 

Gregor