adding fused multiply and add to numpy
Hi, Since AMDs bulldozer and Intels Haswell x86 cpus now also support the fusedmultiplyandadd operation in hardware.
http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
This operation is interesting for numpy for two reasons:  Only one rounding is involved in two arithmetic operations, this is good reducing rounding errors.  Two operations are done in one memory pass, so it improves the performance if ones operations are bound by the memory bandwidth which is very common in numpy.
I have done some experiments using a custom ufunc: https://github.com/juliantaylor/npufuncs
See the README.md on how to try it out. It requires a recent GCC compiler, at least 4.7 I think.
It contains SSE2, AVX, FMA3 (AVX2), FMA4 and software emulation variants. Edit the file to select which one to use. Note if the cpu does not support the instruction it will just crash. Only the latter three are real FMA operations, the SSE2 and AVX variants just perform two regular operations in one loop.
My current machine only supports SSE2, so here are the timings for it:
In [25]: a = np.arange(500000.) In [26]: b = np.arange(500000.) In [27]: c = np.arange(500000.)
In [28]: %timeit npfma.fma(a, b, c) 100 loops, best of 3: 2.49 ms per loop
In [30]: def pure_numpy_fma(a,b,c): ....: return a * b + c
In [31]: %timeit pure_numpy_fma(a, b, c) 100 loops, best of 3: 7.36 ms per loop
In [32]: def pure_numpy_fma2(a,b,c): ....: tmp = a *b ....: tmp += c ....: return tmp
In [33]: %timeit pure_numpy_fma2(a, b, c) 100 loops, best of 3: 3.47 ms per loop
As you can see even without real hardware support it is about 30% faster than inplace unblocked numpy due better use of memory bandwidth. Its even more than two times faster than unoptimized numpy.
If you have a machine capable of fma instructions give it a spin to see if you get similar or better results. Please verify the assembly (objdump d fma<suffix>.o) to check if the compiler properly used the machine fma.
An issue is software emulation of real fma. This can be enabled in the test ufunc with npfma.set_type("libc"). This is unfortunately incredibly slow about a factor 300 on my machine without hardware fma. This means we either have a function that is fast on some platforms and slow on others but always gives the same result or we have a fast function that gives better results on some platforms. Given that we are not worth that what numpy currently provides I favor the latter.
Any opinions on whether this should go into numpy or maybe stay a third party ufunc?
Concerning the interface one should probably add several variants mirroring the FMA3 instruction set: http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
additionally there is fmaddsub (a0 * b0 + c0, a1 *b1  c1) which can be used for complex numbers, but they probably don't need an explicit numpy interface.
On Wed, Jan 8, 2014 at 2:39 PM, Julian Taylor <jtaylor.debian@googlemail.com
wrote:
Hi, Since AMDs bulldozer and Intels Haswell x86 cpus now also support the fusedmultiplyandadd operation in hardware.
http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation
This operation is interesting for numpy for two reasons:
 Only one rounding is involved in two arithmetic operations, this is
good reducing rounding errors.
 Two operations are done in one memory pass, so it improves the
performance if ones operations are bound by the memory bandwidth which is very common in numpy.
I have done some experiments using a custom ufunc: https://github.com/juliantaylor/npufuncs
See the README.md on how to try it out. It requires a recent GCC compiler, at least 4.7 I think.
It contains SSE2, AVX, FMA3 (AVX2), FMA4 and software emulation variants. Edit the file to select which one to use. Note if the cpu does not support the instruction it will just crash. Only the latter three are real FMA operations, the SSE2 and AVX variants just perform two regular operations in one loop.
My current machine only supports SSE2, so here are the timings for it:
In [25]: a = np.arange(500000.) In [26]: b = np.arange(500000.) In [27]: c = np.arange(500000.)
In [28]: %timeit npfma.fma(a, b, c) 100 loops, best of 3: 2.49 ms per loop
In [30]: def pure_numpy_fma(a,b,c): ....: return a * b + c
In [31]: %timeit pure_numpy_fma(a, b, c) 100 loops, best of 3: 7.36 ms per loop
In [32]: def pure_numpy_fma2(a,b,c): ....: tmp = a *b ....: tmp += c ....: return tmp
In [33]: %timeit pure_numpy_fma2(a, b, c) 100 loops, best of 3: 3.47 ms per loop
As you can see even without real hardware support it is about 30% faster than inplace unblocked numpy due better use of memory bandwidth. Its even more than two times faster than unoptimized numpy.
If you have a machine capable of fma instructions give it a spin to see if you get similar or better results. Please verify the assembly (objdump d fma<suffix>.o) to check if the compiler properly used the machine fma.
An issue is software emulation of real fma. This can be enabled in the test ufunc with npfma.set_type("libc"). This is unfortunately incredibly slow about a factor 300 on my machine without hardware fma. This means we either have a function that is fast on some platforms and slow on others but always gives the same result or we have a fast function that gives better results on some platforms. Given that we are not worth that what numpy currently provides I favor the latter.
Any opinions on whether this should go into numpy or maybe stay a third party ufunc?
Multiply and add is a standard function that I think would be good to have in numpy. Not only does it save on memory accesses, it saves on temporary arrays.
Another function that could be useful is a a**2 function, abs2 perhaps.
Chuck
Charles R Harris wrote:
On Wed, Jan 8, 2014 at 2:39 PM, Julian Taylor <jtaylor.debian@googlemail.com
wrote:
...
Another function that could be useful is a a**2 function, abs2 perhaps.
Chuck
I use mag_sqr all the time. It should be much faster for complex, if computed via:
x.real**2 + x.imag**2
avoiding the sqrt of abs.
On 08/01/14 21:39, Julian Taylor wrote:
An issue is software emulation of real fma. This can be enabled in the test ufunc with npfma.set_type("libc"). This is unfortunately incredibly slow about a factor 300 on my machine without hardware fma. This means we either have a function that is fast on some platforms and slow on others but always gives the same result or we have a fast function that gives better results on some platforms. Given that we are not worth that what numpy currently provides I favor the latter.
Any opinions on whether this should go into numpy or maybe stay a third party ufunc?
My preference would be to initially add an "madd" intrinsic. This can be supported on all platforms and can be documented to permit the use of FMA where available.
A 'true' FMA intrinsic function should only be provided when hardware FMA support is available. Many of the more interesting applications of FMA depend on there only being a single rounding step and as such "FMA" should probably mean "a*b + c with only a single rounding".
Regards, Freddie.
Hi,
It happen frequently that NumPy isn't compiled with all instruction that is available where it run. For example in distro. So if the decision is made to use the fast version when we don't use the newer instruction, the user need a way to know that. So the library need a function/attribute to tell that.
How hard would it be to provide the choise to the user? We could provide 2 functions like: fma_fast() fma_prec() (for precision)? Or this could be a parameter or a user configuration option like for the overflow/underflow error.
Fred
On Thu, Jan 9, 2014 at 9:43 AM, Freddie Witherden freddie@witherden.org wrote:
On 08/01/14 21:39, Julian Taylor wrote:
An issue is software emulation of real fma. This can be enabled in the test ufunc with npfma.set_type("libc"). This is unfortunately incredibly slow about a factor 300 on my machine without hardware fma. This means we either have a function that is fast on some platforms and slow on others but always gives the same result or we have a fast function that gives better results on some platforms. Given that we are not worth that what numpy currently provides I favor the latter.
Any opinions on whether this should go into numpy or maybe stay a third party ufunc?
My preference would be to initially add an "madd" intrinsic. This can be supported on all platforms and can be documented to permit the use of FMA where available.
A 'true' FMA intrinsic function should only be provided when hardware FMA support is available. Many of the more interesting applications of FMA depend on there only being a single rounding step and as such "FMA" should probably mean "a*b + c with only a single rounding".
Regards, Freddie.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien nouiz@nouiz.org wrote:
Hi,
It happen frequently that NumPy isn't compiled with all instruction that is available where it run. For example in distro. So if the decision is made to use the fast version when we don't use the newer instruction, the user need a way to know that. So the library need a function/attribute to tell that.
As these instructions are very new runtime cpu feature detection is required. That way also distribution users get the fast code if their cpu supports it.
How hard would it be to provide the choise to the user? We could provide 2 functions like: fma_fast() fma_prec() (for precision)? Or this could be a parameter or a user configuration option like for the overflow/underflow error.
I like Freddie Witherden proposal to name the function madd which does not guarantee one rounding operation. This leaves the namespace open for a special fma function with that guarantee. It can use the libc fma function which is very slow sometimes but platform independent. This is assuming apple did not again take shortcuts like they did with their libc hypot implementation, can someone disassemble apple libc to check what they are doing for C99 fma? And it leaves users the possibility to use the faster madd function if they do not need the precision guarantee.
Another option would be a precision context manager which tells numpy which variant to use. This would also be useful for other code (like abs/hypot/abs2/sum/reciprocal sqrt) but probably it involves more work.
with numpy.precision_mode('fast'): ... # allow no fma, use fast hypot, fast sum, ignore overflow/invalid errors
with numpy.precision_mode('precise'): ... # require fma, use precise hypot, use exact summation (math.fsum) or at least kahan summation, full overflow/invalid checks etc
On Thu, Jan 9, 2014 at 9:43 AM, Freddie Witherden freddie@witherden.org wrote:
On 08/01/14 21:39, Julian Taylor wrote:
An issue is software emulation of real fma. This can be enabled in the test ufunc with npfma.set_type("libc"). This is unfortunately incredibly slow about a factor 300 on my machine without hardware fma. This means we either have a function that is fast on some platforms and slow on others but always gives the same result or we have a fast function that gives better results on some platforms. Given that we are not worth that what numpy currently provides I favor the latter.
Any opinions on whether this should go into numpy or maybe stay a third party ufunc?
My preference would be to initially add an "madd" intrinsic. This can be supported on all platforms and can be documented to permit the use of FMA where available.
A 'true' FMA intrinsic function should only be provided when hardware FMA support is available. Many of the more interesting applications of FMA depend on there only being a single rounding step and as such "FMA" should probably mean "a*b + c with only a single rounding".
Regards, Freddie.
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor jtaylor.debian@googlemail.com wrote:
On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien nouiz@nouiz.org wrote:
How hard would it be to provide the choise to the user? We could provide 2 functions like: fma_fast() fma_prec() (for precision)? Or this could be a parameter or a user configuration option like for the overflow/underflow error.
I like Freddie Witherden proposal to name the function madd which does not guarantee one rounding operation. This leaves the namespace open for a special fma function with that guarantee. It can use the libc fma function which is very slow sometimes but platform independent. This is assuming apple did not again take shortcuts like they did with their libc hypot implementation, can someone disassemble apple libc to check what they are doing for C99 fma? And it leaves users the possibility to use the faster madd function if they do not need the precision guarantee.
If madd doesn't provide any rounding guarantees, then its only reason for existence is that it provides a fused a*b+c loop that better utilizes memory bandwidth, right? I'm guessing that speedwise it doesn't really matter whether you use the fancy AVX instructions or not, since even the naive implementation is memory bound  the advantage is just in the fusion?
Lack of loop fusion is obviously a major limitation of numpy, but it's a very general problem. I'm sceptical about whether we want to get into the business of adding functions whose only purpose is to provide prefused loops. After madd, what other operations should we provide like this? msub (a*bc)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)? How do we decide? Surely it's better to direct people who are hitting memory bottlenecks to much more powerful and general solutions to this problem, like numexpr/cython/numba/theano?
(OTOH the verison that gives rounding guarantees is obviously a unique new feature.)
n
Good questions where do we stop.
I think as you that the fma with guarantees is a good new feature. But if this is made available, people will want to use it for speed. Some people won't like to use another library or dependency. They won't like to have random speed up or slow down. So why not add the ma and fma and trace the line to the operation implemented on the CPU that have an fused version? That will make a sensible limit I think.
Anyway, we won't use it directly. This is just my taught.
Do you know if those instruction are automatically used by gcc if we use the good architecture parameter?
Fred
On Thu, Jan 9, 2014 at 12:07 PM, Nathaniel Smith njs@pobox.com wrote:
On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor jtaylor.debian@googlemail.com wrote:
On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien nouiz@nouiz.org wrote:
How hard would it be to provide the choise to the user? We could provide 2 functions like: fma_fast() fma_prec() (for precision)? Or this could be a parameter or a user configuration option like for the overflow/underflow error.
I like Freddie Witherden proposal to name the function madd which does not guarantee one rounding operation. This leaves the namespace open for a special fma function with that guarantee. It can use the libc fma function which is very slow sometimes but platform independent. This is assuming apple did not again take shortcuts like they did with their libc hypot implementation, can someone disassemble apple libc to check what they are doing for C99 fma? And it leaves users the possibility to use the faster madd function if they do not need the precision guarantee.
If madd doesn't provide any rounding guarantees, then its only reason for existence is that it provides a fused a*b+c loop that better utilizes memory bandwidth, right? I'm guessing that speedwise it doesn't really matter whether you use the fancy AVX instructions or not, since even the naive implementation is memory bound  the advantage is just in the fusion?
Lack of loop fusion is obviously a major limitation of numpy, but it's a very general problem. I'm sceptical about whether we want to get into the business of adding functions whose only purpose is to provide prefused loops. After madd, what other operations should we provide like this? msub (a*bc)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)? How do we decide? Surely it's better to direct people who are hitting memory bottlenecks to much more powerful and general solutions to this problem, like numexpr/cython/numba/theano?
(OTOH the verison that gives rounding guarantees is obviously a unique new feature.)
n _______________________________________________ NumPyDiscussion mailing list NumPyDiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
On 10.01.2014 01:49, Frédéric Bastien wrote:
Do you know if those instruction are automatically used by gcc if we use the good architecture parameter?
they are used if you enable ffpcontract=fast. Do not set it to `on` this is an alias to `off` due to the semantics of C. ffastmath enables in in gcc 4.7 and 4.8 but not in 4.9 but this might be a bug, I filed one a while ago.
Also you need to set the mfma or arch=bdver{1,2,3,4}. Its not part of mavx2 last I checked.
But there are not many places in numpy the compiler can use it, only dot comes to mind which goes over blas libraries in the high performance case.
On 8 January 2014 22:39, Julian Taylor jtaylor.debian@googlemail.comwrote:
As you can see even without real hardware support it is about 30% faster than inplace unblocked numpy due better use of memory bandwidth. Its even more than two times faster than unoptimized numpy.
I have an i5, and AVX crashes, even though it is supported by my CPU. Here are my timings:
SSE2:
In [24]: %timeit npfma.fma(a, b, c) 100000 loops, best of 3: 15 us per loop
In [28]: %timeit npfma.fma(a, b, c) 100 loops, best of 3: 2.36 ms per loop
In [29]: %timeit npfma.fms(a, b, c) 100 loops, best of 3: 2.36 ms per loop
In [31]: %timeit pure_numpy_fma(a, b, c) 100 loops, best of 3: 7.5 ms per loop
In [33]: %timeit pure_numpy_fma2(a, b, c) 100 loops, best of 3: 4.41 ms per loop
The model supports all the way to sse4_2
libc:
In [24]: %timeit npfma.fma(a, b, c) 1000 loops, best of 3: 883 us per loop
In [28]: %timeit npfma.fma(a, b, c) 10 loops, best of 3: 88.7 ms per loop
In [29]: %timeit npfma.fms(a, b, c) 10 loops, best of 3: 87.4 ms per loop
In [31]: %timeit pure_numpy_fma(a, b, c) 100 loops, best of 3: 7.94 ms per loop
In [33]: %timeit pure_numpy_fma2(a, b, c) 100 loops, best of 3: 3.03 ms per loop
If you have a machine capable of fma instructions give it a spin to see if you get similar or better results. Please verify the assembly (objdump d fma<suffix>.o) to check if the compiler properly used the machine fma.
Following the instructions in the readme, there is only one compiled file, npfma.so, but no .o.
/David.
On Thu, Jan 9, 2014 at 3:54 PM, Daπid davidmenhur@gmail.com wrote:
On 8 January 2014 22:39, Julian Taylor jtaylor.debian@googlemail.comwrote:
As you can see even without real hardware support it is about 30% faster than inplace unblocked numpy due better use of memory bandwidth. Its even more than two times faster than unoptimized numpy.
I have an i5, and AVX crashes, even though it is supported by my CPU.
I forgot about the 32 byte alignment avx (as it is used in this code) requires. I pushed a new version that takes care of it. It should now work with avx.
Following the instructions in the readme, there is only one compiled file, npfma.so, but no .o.
the .o files are in the build/ subfolder
participants (7)

Charles R Harris

Daπid

Freddie Witherden

Frédéric Bastien

Julian Taylor

Nathaniel Smith

Neal Becker