[Numpy-discussion] Slicing slower than matrix multiplication?

Mon Dec 14 11:09:13 EST 2009

A Saturday 12 December 2009 12:59:16 Jasper van de Gronde escrigué:
> Francesc Alted wrote:
> > ...
> > Yeah, I think taking slices here is taking quite a lot of time:
> >
> > In [58]: timeit E + Xi2[P/2,:]
> > 100000 loops, best of 3: 3.95 µs per loop
> >
> > In [59]: timeit E + Xi2[P/2]
> > 100000 loops, best of 3: 2.17 µs per loop
> >
> > don't know why the additional ',:' in the slice is taking so much time,
> > but my guess is that passing & analyzing the second argument
> > (slice(None,None,None)) could be the responsible for the slowdown (but
> > that is taking too much time). Mmh, perhaps it would be worth to study
> > this more carefully so that an optimization could be done in NumPy.
> 
> This is indeed interesting! And very nice that this actually works the
> way you'd expect it to. I guess I've just worked too long with Matlab :)
> 
> >> I think the lesson mostly should be that with so little data,
> >> benchmarking becomes a very difficult art.
> >
> > Well, I think it is not difficult, it is just that you are perhaps
> > benchmarking Python/NumPy machinery instead ;-)  I'm curious whether
> > Matlab can do slicing much more faster than NumPy.  Jasper?
> 
> I had a look, these are the timings for Python for 60x20:
>    Dot product: 0.051165 (5.116467e-06 per iter)
>    Add a row: 0.092849 (9.284860e-06 per iter)
>    Add a column: 0.082523 (8.252348e-06 per iter)
> For Matlab 60x20:
>    Dot product: 0.029927 (2.992664e-006 per iter)
>    Add a row: 0.019664 (1.966444e-006 per iter)
>    Add a column: 0.008384 (8.384376e-007 per iter)
> For Python 600x200:
>    Dot product: 1.917235 (1.917235e-04 per iter)
>    Add a row: 0.113243 (1.132425e-05 per iter)
>    Add a column: 0.162740 (1.627397e-05 per iter)
> For Matlab 600x200:
>    Dot product: 1.282778 (1.282778e-004 per iter)
>    Add a row: 0.107252 (1.072525e-005 per iter)
>    Add a column: 0.021325 (2.132527e-006 per iter)
> 
> If I fit a line through these two data points (60 and 600 rows), I get
> the following equations:
>    Python, AR: 3.8e-5 * n + 0.091
>    Matlab, AC: 2.4e-5 * n + 0.0069
> This would suggest that Matlab performs the vector addition about 1.6
> times faster and has a 13 times smaller constant cost!

The things seems to be worst than 1.6x times slower for numpy, as matlab 
orders arrays by column, while numpy order is by row.  So, if we want to 
compare pears with pears:

For Python 600x200:
   Add a row: 0.113243 (1.132425e-05 per iter)
For Matlab 600x200:
   Add a column: 0.021325 (2.132527e-006 per iter)

which makes numpy 5x slower than matlab.  Hmm, I definitely think that numpy 
could do better here :-/

However, caveat emptor, when you do timings, you normally put your code 
snippets in loops, and after the first iteration, the dataset (if small 
enough, as in your examples above) lives in CPU caches.  But this is not 
*usually* the case because you have to transmit your data to CPU first.  This 
transmission process is normally the main bottleneck when doing BLAS-1 level 
operations (i.e. vector-vector).  This is to say that, in real-life 
calculations your numpy code will work almost as fast as matlab.  So, my 
adivce is: don't be too worried about small dataset speed in small loops, and 
concentrate your optimization efforts in making your *real* code faster.

-- 
Francesc Alted