# [Numpy-discussion] Help in speeding up accumulation in a matrix

Nicolas SCHEFFER scheffer.nicolas at gmail.com
Sun Jan 30 14:37:07 EST 2011

```Hi all,

Thanks for all of the answers, it gives me a lot of new ideas and new
functions I didn't know of.
@Charles:
The reshape way is a great idea! It gives a great alternative to the
for loop for your code to be vectorized. I tested it I get
1 loops, best of 3: 1.35 s per loop
Msr being already reshaped. So better than dot but far from the naive way.

@eat:
Yes I need these matrices in memory. I'm doing this operation tens of
thousands of time using different weights (it's an ML algorithm where
w is given by each input example). How would not having them in memory
improve the speed?

@josef
So triu_indices is how you take care of symmetric matrices in Numpy?
I read on the mailing lists that some of the implementations of these
slicings might be slow though. I didn't have the chance to test that
yet, but I sure will.

@gregor
Ok, I feel stupid ;) Thanks Gregor. I really thought I checked the
compatibility of the array.flags before the dot product but it seems I
didn't.
This is the solution, making sure to be C_contiguous before doing the dot.

That gives a 4x improvement over the naive implementation
1 loops, best of 3: 392 ms per loop

In [18]: %timeit dot(Ms.T,w)
1 loops, best of 3: 1.81 s per loop

In [19]: %timeit dot(MsT,w)
10 loops, best of 3: 86.2 ms per loop

That's a hell of a boost! I don't think we could do better than that.

However it's doing almost twice the work it needs to do (symmetric
matrices), so it probably can be done faster. Maybe the underlying
BLAS code takes care of that w/o me knowing though. For now, I think
I'll be fine with 4x improvement!

Thanks much for the help!

-n

On Sun, Jan 30, 2011 at 8:12 AM, Gregor Thalhammer
<gregor.thalhammer at gmail.com> wrote:
>
> Am 29.1.2011 um 22:01 schrieb Nicolas SCHEFFER:
>
>> Hi all,
>>
>> First email to the list for me, I just want to say how grateful I am
>> to have python+numpy+ipython etc... for my day to day needs. Great
>> combination of software.
>>
>> Anyway, I've been having this bottleneck in one my algorithms that has
>> been bugging me for quite a while.
>> The objective is to speed this part up. I've been doing tons of
>> optimization and parallel processing around that piece of code to get
>> a decent run time.
>>
>> The problem is easy. You want to accumulate in a matrix, a weighted
>> sum of other matrices. Let's call this function scale_and_add:
>>        R+=w[i]*Ms[i]
>>    return R
>> This 'for' loop bugs me since I know this will slow things down.
>>
>> But the dimension of these things are so large that any attempt to
>> vectorize this is slower and takes too much memory.
>> I typically work with 1000 weights and matrices, matrices of dimension
>> of several hundred.
>>
>> My current config is:
>> 1 loops, best of 3: 392 ms per loop
>>
>> In [3]: w.shape
>> Out[3]: (2000,)
>>
>> In [4]: Ms.shape
>> Out[4]: (2000, 200, 200)
>> I'd like to be able to double these dimensions.
>>
>> For instance I could use broadcasting by using a dot product
>> %timeit dot(Ms.T,w)
>> 1 loops, best of 3: 1.77 s per loop
>> But this is i) slower ii) takes too much memory
>> (btw, I'd really need an inplace dot-product in numpy to avoid the
>> copy in memory, like the blas call in scipy.linalg. But that's for an
>
> If you use a different memory layout for your data, you can improve your performance with dot:
>
> MsT = Ms.T.copy()
> %timeit np.dot(M,w)
>
> 10 loops, best of 3: 107 ms per loop
>
> for comparison:
> 1 loops, best of 3: 245 ms per loop
>
> I don't think you can do much better than this. The above value gives a memory bandwidth for accessing Ms of 6 GB/s, I think the hardware limit for my system is about 10 Gb/s.
>
> Gregor
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

```