One more thing to mention on this topic.
From a certain size dot product becomes faster than sum (due to parallelisation I guess?).
E.g. def dotsum(arr): a = arr.reshape(1000, 100) return a.dot(np.ones(100)).sum()
a = np.ones(100000)
In [45]: %timeit np.add.reduce(a, axis=None) 42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [43]: %timeit dotsum(a) 26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
But theoretically, sum, should be faster than dot product by a fair bit.
Isn’t parallelisation implemented for it?
I cannot reproduce that: In [3]: %timeit np.add.reduce(a, axis=None) 19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) In [4]: %timeit dotsum(a) 47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) But almost certainly it is indeed due to optimizations, since .dot uses BLAS which is highly optimized (at least on some platforms, clearly better on yours than on mine!). I thought .sum() was optimized too, but perhaps less so? It may be good to raise a quick issue about this! Thanks, Marten