[Numpy-discussion] adding fused multiply and add to numpy
Daπid
davidmenhur at gmail.com
Thu Jan 9 09:54:42 EST 2014
On 8 January 2014 22:39, Julian Taylor <jtaylor.debian at googlemail.com>wrote:
> As you can see even without real hardware support it is about 30% faster
> than inplace unblocked numpy due better use of memory bandwidth. Its
> even more than two times faster than unoptimized numpy.
>
I have an i5, and AVX crashes, even though it is supported by my CPU.
Here are my timings:
SSE2:
In [24]: %timeit npfma.fma(a, b, c)
100000 loops, best of 3: 15 us per loop
In [28]: %timeit npfma.fma(a, b, c)
100 loops, best of 3: 2.36 ms per loop
In [29]: %timeit npfma.fms(a, b, c)
100 loops, best of 3: 2.36 ms per loop
In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.5 ms per loop
In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 4.41 ms per loop
The model supports all the way to sse4_2
libc:
In [24]: %timeit npfma.fma(a, b, c)
1000 loops, best of 3: 883 us per loop
In [28]: %timeit npfma.fma(a, b, c)
10 loops, best of 3: 88.7 ms per loop
In [29]: %timeit npfma.fms(a, b, c)
10 loops, best of 3: 87.4 ms per loop
In [31]: %timeit pure_numpy_fma(a, b, c)
100 loops, best of 3: 7.94 ms per loop
In [33]: %timeit pure_numpy_fma2(a, b, c)
100 loops, best of 3: 3.03 ms per loop
> If you have a machine capable of fma instructions give it a spin to see
> if you get similar or better results. Please verify the assembly
> (objdump -d fma-<suffix>.o) to check if the compiler properly used the
> machine fma.
>
Following the instructions in the readme, there is only one compiled file,
npfma.so, but no .o.
/David.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20140109/7a2163c6/attachment.html>
More information about the NumPy-Discussion
mailing list