[Numpy-discussion] Optimizing reduction loops (sum(), prod(), et al.)
Pauli Virtanen
pav+sp at iki.fi
Wed Jul 8 18:55:05 EDT 2009
On 2009-07-08, Charles R Harris <charlesr.harris at gmail.com> wrote:
[clip]
> How do the benchmarks compare with just making contiguous copies? Which is
> blocking of sort, I suppose.
I think that's slower than just walking over the discontiguous
array:
1) The new code: (on the Athlon machine)
$ ./bench-reduce.py ../../../build/lib.linux-i686-2.6 8000,260 0,1 0
<8000> {260} [ <2080> {8} ]
-15 vs. 0.876923 : 1 2
new 8000,260 0,1 0 27.7 29.5 28.9 30.2 30.3
2) Old code, with "x.transpose(1,0).copy().sum(axis=-1)":
$ ./bench-reduce.py '' -o 8000,260 0,1 0
old 8000,260 0,1 0 5.03 5.13 5.11 5.07 4.87
3) Old code, with "x.sum(axis=0)"
$ ./bench-reduce.py '' 8000,260 0,1 0
old 8000,260 0,1 0 18.6 17.3 16.6 18.9 17.2
Last five numbers on the rows are repeats per second, five
repeats. (Above, the trailing dimension 260 was chosen so that it
maximizes the speed of the old code: 260 is almost twice faster
than 256... This seems to be a cache padding effect.)
I think this is expected: the main bottleneck is probably the
memory bandwidth, so when you make a copy and then reduce it, you
round-trip the array twice through the CPU cache. This is likely
to be more expensive than a single trip, even if it's
discontiguous.
Also, I'm not sure if copying a discontiguous array in Numpy
actually has any cache optimization, or whether it just walks the
array in virtual C-order copying the data. If so, then it
probably hits similar cache problems as the non-optimized
reduction operation.
--
Pauli Virtanen
More information about the NumPy-Discussion
mailing list