[Numpy-discussion] adding fused multiply and add to numpy
njs at pobox.com
Thu Jan 9 12:07:00 EST 2014
On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor
<jtaylor.debian at googlemail.com> wrote:
> On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien <nouiz at nouiz.org> wrote:
>> How hard would it be to provide the choise to the user? We could
>> provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
>> this could be a parameter or a user configuration option like for the
>> overflow/underflow error.
> I like Freddie Witherden proposal to name the function madd which does not
> guarantee one rounding operation. This leaves the namespace open for a
> special fma function with that guarantee. It can use the libc fma function
> which is very slow sometimes but platform independent. This is assuming
> apple did not again take shortcuts like they did with their libc hypot
> implementation, can someone disassemble apple libc to check what they are
> doing for C99 fma?
> And it leaves users the possibility to use the faster madd function if they
> do not need the precision guarantee.
If madd doesn't provide any rounding guarantees, then its only reason
for existence is that it provides a fused a*b+c loop that better
utilizes memory bandwidth, right? I'm guessing that speed-wise it
doesn't really matter whether you use the fancy AVX instructions or
not, since even the naive implementation is memory bound -- the
advantage is just in the fusion?
Lack of loop fusion is obviously a major limitation of numpy, but it's
a very general problem. I'm sceptical about whether we want to get
into the business of adding functions whose only purpose is to provide
pre-fused loops. After madd, what other operations should we provide
like this? msub (a*b-c)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)?
How do we decide? Surely it's better to direct people who are hitting
memory bottlenecks to much more powerful and general solutions to this
problem, like numexpr/cython/numba/theano?
(OTOH the verison that gives rounding guarantees is obviously a unique
More information about the NumPy-Discussion