[Numpy-discussion] Ufunc memory access optimization

Tue Jun 15 10:10:24 EDT 2010

Correct me if I'm wrong, but this code still doesn't seem to make the
optimization of flattening arrays as much as possible. The array you
get out of np.zeros((100,100)) can be iterated over as an array of
shape (10000,), which should yield very substantial speedups. Since
most arrays one operates on are like this, there's potentially a large
speedup here. (On the other hand, if this optimization is being done,
then these tests are somewhat deceptive.)

On the other hand, it seems to me there's still some question about
how to optimize execution order when the ufunc is dealing with two or
more arrays with different memory layouts. In such a case, which one
do you reorder in favour of? Is it acceptable to return
freshly-allocated arrays that are not C-contiguous?

Anne

On 15 June 2010 07:37, Pauli Virtanen <pav at iki.fi> wrote:
> pe, 2010-06-11 kello 10:52 +0200, Hans Meine kirjoitti:
>> On Friday 11 June 2010 10:38:28 Pauli Virtanen wrote:
> [clip]
>> > I think I there was some code ready to implement this shuffling. I'll try
>> > to dig it out and implement the shuffling.
>>
>> That would be great!
>>
>> Ullrich Köthe has implemented this for our VIGRA/numpy bindings:
>>   http://tinyurl.com/fast-ufunc
>> At the bottom you can see that he basically wraps all numpy.ufuncs he can find
>> in the numpy top-level namespace automatically.
>
> Ok, here's the branch:
>
>        http://github.com/pv/numpy-work/compare/master...feature;ufunc-memory-access-speedup
>
> Some samples: (the reported times in braces are times without the
> optimization)
>
>        x = np.zeros([100,100])
>        %timeit x + x
>        10000 loops, best of 3: 106 us (99.1 us) per loop
>        %timeit x.T + x.T
>        10000 loops, best of 3: 114 us (164 us) per loop
>        %timeit x.T + x
>        10000 loops, best of 3: 176 us (171 us) per loop
>
>        x = np.zeros([100,5,5])
>        %timeit x.T + x.T
>        10000 loops, best of 3: 47.7 us (61 us) per loop
>
>        x = np.zeros([100,5,100]).transpose(2,0,1)
>        %timeit np.cos(x)
>        100 loops, best of 3: 3.77 ms (9 ms) per loop
>
> As expected, some improvement can be seen. There's also appears to be
> an additional 5 us (~ 700 inner loop operations it seems) overhead
> coming from somewhere; perhaps this can still be reduced.
>
> --
> Pauli Virtanen
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>