Wed Jan 8 16:39:07 EST 2014

```Hi,
Since AMDs bulldozer and Intels Haswell x86 cpus now also support the

http://en.wikipedia.org/wiki/Multiply–accumulate_operation

This operation is interesting for numpy for two reasons:
- Only one rounding is involved in two arithmetic operations, this is
good reducing rounding errors.
- Two operations are done in one memory pass, so it improves the
performance if ones operations are bound by the memory bandwidth which
is very common in numpy.

I have done some experiments using a custom ufunc:
https://github.com/juliantaylor/npufuncs

See the README.md on how to try it out. It requires a recent GCC
compiler, at least 4.7 I think.

It contains SSE2, AVX, FMA3 (AVX2), FMA4 and software emulation
variants. Edit the file to select which one to use. Note if the cpu does
not support the instruction it will just crash.
Only the latter three are real FMA operations, the SSE2 and AVX variants
just perform two regular operations in one loop.

My current machine only supports SSE2, so here are the timings for it:

In [25]: a = np.arange(500000.)
In [26]: b = np.arange(500000.)
In [27]: c = np.arange(500000.)

In [28]: %timeit npfma.fma(a, b, c)
100 loops, best of 3: 2.49 ms per loop

In [30]: def pure_numpy_fma(a,b,c):
....:     return a * b + c

In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.36 ms per loop

In [32]: def pure_numpy_fma2(a,b,c):
....:     tmp = a *b
....:     tmp += c
....:     return tmp

In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 3.47 ms per loop

As you can see even without real hardware support it is about 30% faster
than inplace unblocked numpy due better use of memory bandwidth. Its
even more than two times faster than unoptimized numpy.

If you have a machine capable of fma instructions give it a spin to see
if you get similar or better results. Please verify the assembly
(objdump -d fma-<suffix>.o) to check if the compiler properly used the
machine fma.

An issue is software emulation of real fma. This can be enabled in the
test ufunc with npfma.set_type("libc").
This is unfortunately incredibly slow about a factor 300 on my machine
without hardware fma.
This means we either have a function that is fast on some platforms and
slow on others but always gives the same result or we have a fast
function that gives better results on some platforms.
Given that we are not worth that what numpy currently provides I favor
the latter.

Any opinions on whether this should go into numpy or maybe stay a third
party ufunc?

Concerning the interface one should probably add several variants
mirroring the FMA3 instruction set:
http://en.wikipedia.org/wiki/Multiply–accumulate_operation

additionally there is fmaddsub (a0 * b0 + c0, a1 *b1 - c1) which can be
used for complex numbers, but they probably don't need an explicit numpy
interface.

```