[Numpy-discussion] Vectorizing code, for loops, and all that

Mon Oct 2 19:03:50 EDT 2006

Albert Strasheim wrote:

>In [571]: x1 = N.random.randn(2000,39)
>
>In [572]: y1 = N.random.randn(64,39)
>
>In [574]: %timeit z1=x1[...,N.newaxis,...]-y1 10 loops, best of 3: 703 ms
>per loop
>
>In [575]: z1.shape
>Out[575]: (2000, 64, 39)
>
>As far as I can figure, this operation is doing 2000*64*39 subtractions.
>Doing this straight up yields the following:
>
>In [576]: x2 = N.random.randn(2000,64,39)
>
>In [577]: y2 = N.random.randn(2000,64,39)
>
>In [578]: %timeit z2 = x2-y2
>10 loops, best of 3: 108 ms per loop
>
>Does anybody have any ideas on why this is so much faster? Hopefully I
>didn't mess up somewhere...
>  
>

I suspect I know why, although the difference seems rather large.  There 
is code optimization that is being taken advantage of in the second 
case.  If you have contiguous arrays (no broadcasting needed), then 1 
C-loop is used for the subtraction (your second case).

In the first case you are using broadcasting to generate the larger 
array.  This requires more complicated looping constructs under the 
covers which causes your overhead.  Bascially, you will have 64*39 1-d 
loops of 2000 elements each in the first example with a bit of 
calculation over-head to reset the pointers before each loop.

In the ufunc code, compare the ONE_UFUNCLOOP case with the 
NOBUFFER_UFUNCLOOP case.  If you want to be sure what is running 
un-comment the fprintf statements so you can tell.

I'm surprised the overhead of adjusting pointers is so high, but then 
again you are probably getting a lot of cache misses in the first case 
so there is more to it than that, the loops may run more slowly too.

-Travis