[Numpy-discussion] odd performance of sum?

Thu Feb 10 13:16:12 EST 2011

On Thu, Feb 10, 2011 at 11:53, eat <e.antero.tammi at gmail.com> wrote:
> Thanks Chuck,
>
> for replying. But don't you still feel very odd that dot outperforms sum in
> your machine? Just to get it simply; why sum can't outperform dot? Whatever
> architecture (computer, cache) you have, it don't make any sense at all that
> when performing significantly less instructions, you'll reach to spend more
> time ;-).

These days, the determining factor is less often instruction count
than memory latency, and the optimized BLAS implementations of dot()
heavily optimize the memory access patterns. Additionally, the number
of instructions in your dot() probably isn't that many more than the
sum(). The sum() is pretty dumb and just does a linear accumulation
using the ufunc reduce mechanism, so (m*n-1) ADDs plus quite a few
instructions for traversing the array in a generic manner. With fused
multiply-adds, being able to assume contiguous data and ignore the
numpy iterator overhead, and applying divide-and-conquer kernels to
arrange sums, the optimized dot() implementations could have a
comparable instruction count.

If you were willing to spend that amount of developer time and code
complexity to make platform-specific backends to sum(), you could make
it go really fast, too. Typically, it's not all that important to make
it worthwhile, though. One thing that might be worthwhile is to make
implementations of sum() and cumsum() that avoid the ufunc machinery
and do their iterations more quickly, at least for some common
combinations of dtype and contiguity.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco