Good to know it is not only on my PC.

I have done a fair bit of work trying to find more efficient sum.

The only faster option that I have found was PyTorch. (although thinking about it now maybe it’s because it was using MKL, don’t remember)

MKL is faster, but I use OpenBLAS.

Scipp library is parallelized, and its performance becomes similar to `dotsum` for large arrays, but it is slower than numpy or dotsum for size less than (somewhere towards) ~200k.

Apart from these I ran out of options and simply implemented my own sum, where it uses either `np.sum` or `dotsum` depending on which is faster.

This is the chart, where it can be seen the point where dotsum becomes faster than np.sum.
https://gcdnb.pbrd.co/images/j8n3EsRz5g5v.png?o=1

I am not sure how much (and for how many people) the improvement is needed / essential, but I have found several stack posts regarding this when I was looking into this. It is definitely to me though.

Theoretically, if implemented with same optimisations, sum should be ~2x faster than dotsum. Or am I missing something?

Regards,
DG


On 16 Feb 2024, at 04:54, Homeier, Derek <dhomeie@gwdg.de> wrote:



On 16 Feb 2024, at 2:48 am, Marten van Kerkwijk <mhvk@astro.utoronto.ca> wrote:

In [45]: %timeit np.add.reduce(a, axis=None)
42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [43]: %timeit dotsum(a)
26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But theoretically, sum, should be faster than dot product by a fair bit.

Isn’t parallelisation implemented for it?

I cannot reproduce that:

In [3]: %timeit np.add.reduce(a, axis=None)
19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

In [4]: %timeit dotsum(a)
47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

But almost certainly it is indeed due to optimizations, since .dot uses
BLAS which is highly optimized (at least on some platforms, clearly
better on yours than on mine!).

I thought .sum() was optimized too, but perhaps less so?


I can confirm at least it does not seem to use multithreading – with the conda-installed numpy+BLAS
I almost exactly reproduce your numbers, whereas linked against my own OpenBLAS build

In [3]: %timeit np.add.reduce(a, axis=None)
19 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

# OMP_NUM_THREADS=1
In [4]: %timeit dots(a)
20.5 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# OMP_NUM_THREADS=8
In [4]: %timeit dots(a)
9.84 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

add.reduce shows no difference between the two and always remains at <= 100 % CPU usage.
dotsum is scaling still better with larger matrices, e.g. ~4 x for 1000x1000.

Cheers,
Derek
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-leave@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: dom.grigonis@gmail.com