[Numpy-discussion] adding fused multiply and add to numpy

Daπid davidmenhur at gmail.com
Thu Jan 9 09:54:42 EST 2014


On 8 January 2014 22:39, Julian Taylor <jtaylor.debian at googlemail.com>wrote:

> As you can see even without real hardware support it is about 30% faster
> than inplace unblocked numpy due better use of memory bandwidth. Its
> even more than two times faster than unoptimized numpy.
>

I have an i5, and AVX crashes, even though it is supported by my CPU.
Here are my timings:

SSE2:

In [24]: %timeit npfma.fma(a, b, c)
100000 loops, best of 3: 15 us per loop

In [28]: %timeit npfma.fma(a, b, c)
100 loops, best of 3: 2.36 ms per loop

In [29]: %timeit npfma.fms(a, b, c)
100 loops, best of 3: 2.36 ms per loop

In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.5 ms per loop

In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 4.41 ms per loop

The model supports all the way to sse4_2

libc:

In [24]: %timeit npfma.fma(a, b, c)
1000 loops, best of 3: 883 us per loop

In [28]: %timeit npfma.fma(a, b, c)
10 loops, best of 3: 88.7 ms per loop

In [29]: %timeit npfma.fms(a, b, c)
10 loops, best of 3: 87.4 ms per loop

In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.94 ms per loop

In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 3.03 ms per loop



> If you have a machine capable of fma instructions give it a spin to see
> if you get similar or better results. Please verify the assembly
> (objdump -d fma-<suffix>.o) to check if the compiler properly used the
> machine fma.
>

Following the instructions in the readme, there is only one compiled file,
npfma.so, but no .o.


/David.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140109/7a2163c6/attachment.html>


More information about the NumPy-Discussion mailing list