[Numpy-discussion] odd performance of sum?

Thu Feb 10 16:26:16 EST 2011

On Thu, Feb 10, 2011 at 10:31 AM, Pauli Virtanen <pav at iki.fi> wrote:

> Thu, 10 Feb 2011 12:16:12 -0600, Robert Kern wrote:
> [clip]
> > One thing that might be worthwhile is to make
> > implementations of sum() and cumsum() that avoid the ufunc machinery and
> > do their iterations more quickly, at least for some common combinations
> > of dtype and contiguity.
>
> I wonder what is the balance between the iterator overhead and the time
> taken in the reduction inner loop. This should be straightforward to
> benchmark.
>
> Apparently, some overhead decreased with the new iterators, since current
> Numpy master outperforms 1.5.1 by a factor of 2 for this benchmark:
>
> In [8]: %timeit M.sum(1)     # Numpy 1.5.1
> 10 loops, best of 3: 85 ms per loop
>
> In [8]: %timeit M.sum(1)     # Numpy master
> 10 loops, best of 3: 49.5 ms per loop
>
> I don't think this is explainable by the new memory layout optimizations,
> since M is C-contiguous.
>
> Perhaps there would be room for more optimization, even within the ufunc
> framework?
>

I played around with this in einsum, where it's a bit easier to specialize
this case than in the ufunc machinery. What I found made the biggest
difference is to use SSE prefetching instructions to prepare the cache in
advance. Here are the kind of numbers I get, all from the current Numpy
master:

In [7]: timeit M.sum(1)
10 loops, best of 3: 44.6 ms per loop

In [8]: timeit dot(M, o)
10 loops, best of 3: 36.8 ms per loop

In [9]: timeit einsum('ij->i', M)
10 loops, best of 3: 32.1 ms per loop
...
In [14]: timeit M.sum(1)
10 loops, best of 3: 41.5 ms per loop

In [15]: timeit dot(M, o)
10 loops, best of 3: 42.1 ms per loop

In [16]: timeit einsum('ij->i', M)
10 loops, best of 3: 30 ms per loop

-Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110210/e75df321/attachment.html>