[Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)

Wed Jul 8 18:55:05 EDT 2009

On 2009-07-08, Charles R Harris <charlesr.harris at gmail.com> wrote:
[clip]
> How do the benchmarks compare with just making contiguous copies? Which is
> blocking of sort, I suppose.

I think that's slower than just walking over the discontiguous 
array:

	1) The new code: (on the Athlon machine)

	$ ./bench-reduce.py ../../../build/lib.linux-i686-2.6 8000,260 0,1 0

	   <8000> {260}    [ <2080> {8} ]
	  -15 vs. 0.876923  : 1 2
	new 8000,260 0,1 0 27.7  29.5  28.9  30.2  30.3

	2) Old code, with "x.transpose(1,0).copy().sum(axis=-1)":

	$ ./bench-reduce.py '' -o 8000,260 0,1 0
	old 8000,260 0,1 0 5.03  5.13  5.11  5.07  4.87

	3) Old code, with "x.sum(axis=0)"

	$ ./bench-reduce.py '' 8000,260 0,1 0
	old 8000,260 0,1 0 18.6  17.3  16.6  18.9  17.2

Last five numbers on the rows are repeats per second, five 
repeats. (Above, the trailing dimension 260 was chosen so that it 
maximizes the speed of the old code: 260 is almost twice faster 
than 256... This seems to be a cache padding effect.)

I think this is expected: the main bottleneck is probably the 
memory bandwidth, so when you make a copy and then reduce it, you 
round-trip the array twice through the CPU cache. This is likely 
to be more expensive than a single trip, even if it's 
discontiguous.

Also, I'm not sure if copying a discontiguous array in Numpy 
actually has any cache optimization, or whether it just walks the 
array in virtual C-order copying the data. If so, then it 
probably hits similar cache problems as the non-optimized 
reduction operation.

-- 
Pauli Virtanen