[Numpy-discussion] adding fused multiply and add to numpy
Charles R Harris
charlesr.harris at gmail.com
Wed Jan 8 17:09:58 EST 2014
On Wed, Jan 8, 2014 at 2:39 PM, Julian Taylor <jtaylor.debian at googlemail.com
> Since AMDs bulldozer and Intels Haswell x86 cpus now also support the
> fused-multiply-and-add operation in hardware.
> This operation is interesting for numpy for two reasons:
> - Only one rounding is involved in two arithmetic operations, this is
> good reducing rounding errors.
> - Two operations are done in one memory pass, so it improves the
> performance if ones operations are bound by the memory bandwidth which
> is very common in numpy.
> I have done some experiments using a custom ufunc:
> See the README.md on how to try it out. It requires a recent GCC
> compiler, at least 4.7 I think.
> It contains SSE2, AVX, FMA3 (AVX2), FMA4 and software emulation
> variants. Edit the file to select which one to use. Note if the cpu does
> not support the instruction it will just crash.
> Only the latter three are real FMA operations, the SSE2 and AVX variants
> just perform two regular operations in one loop.
> My current machine only supports SSE2, so here are the timings for it:
> In : a = np.arange(500000.)
> In : b = np.arange(500000.)
> In : c = np.arange(500000.)
> In : %timeit npfma.fma(a, b, c)
> 100 loops, best of 3: 2.49 ms per loop
> In : def pure_numpy_fma(a,b,c):
> ....: return a * b + c
> In : %timeit pure_numpy_fma(a, b, c)
> 100 loops, best of 3: 7.36 ms per loop
> In : def pure_numpy_fma2(a,b,c):
> ....: tmp = a *b
> ....: tmp += c
> ....: return tmp
> In : %timeit pure_numpy_fma2(a, b, c)
> 100 loops, best of 3: 3.47 ms per loop
> As you can see even without real hardware support it is about 30% faster
> than inplace unblocked numpy due better use of memory bandwidth. Its
> even more than two times faster than unoptimized numpy.
> If you have a machine capable of fma instructions give it a spin to see
> if you get similar or better results. Please verify the assembly
> (objdump -d fma-<suffix>.o) to check if the compiler properly used the
> machine fma.
> An issue is software emulation of real fma. This can be enabled in the
> test ufunc with npfma.set_type("libc").
> This is unfortunately incredibly slow about a factor 300 on my machine
> without hardware fma.
> This means we either have a function that is fast on some platforms and
> slow on others but always gives the same result or we have a fast
> function that gives better results on some platforms.
> Given that we are not worth that what numpy currently provides I favor
> the latter.
> Any opinions on whether this should go into numpy or maybe stay a third
> party ufunc?
Multiply and add is a standard function that I think would be good to have
in numpy. Not only does it save on memory accesses, it saves on temporary
Another function that could be useful is a |a|**2 function, abs2 perhaps.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion