[Numpy-discussion] adding fused multiply and add to numpy

Frédéric Bastien nouiz at nouiz.org
Thu Jan 9 19:49:00 EST 2014

Good questions where do we stop.

I think as you that the fma with guarantees is a good new feature. But
if this is made available, people will want to use it for speed. Some
people won't like to use another library or dependency. They won't
like to have random speed up or slow down. So why not add the ma and
fma and trace the line to the operation implemented on the CPU that
have an fused version? That will make a sensible limit I think.

Anyway, we won't use it directly. This is just my taught.

Do you know if those instruction are automatically used by gcc if we
use the good architecture parameter?


On Thu, Jan 9, 2014 at 12:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Thu, Jan 9, 2014 at 3:30 PM, Julian Taylor
> <jtaylor.debian at googlemail.com> wrote:
>> On Thu, Jan 9, 2014 at 3:50 PM, Frédéric Bastien <nouiz at nouiz.org> wrote:
>>> How hard would it be to provide the choise to the user? We could
>>> provide 2 functions like: fma_fast() fma_prec() (for precision)? Or
>>> this could be a parameter or a user configuration option like for the
>>> overflow/underflow error.
>> I like Freddie Witherden proposal to name the function madd which does not
>> guarantee one rounding operation. This leaves the namespace open for a
>> special fma function with that guarantee. It can use the libc fma function
>> which is very slow sometimes but platform independent. This is assuming
>> apple did not again take shortcuts like they did with their libc hypot
>> implementation, can someone disassemble apple libc to check what they are
>> doing for C99 fma?
>> And it leaves users the possibility to use the faster madd function if they
>> do not need the precision guarantee.
> If madd doesn't provide any rounding guarantees, then its only reason
> for existence is that it provides a fused a*b+c loop that better
> utilizes memory bandwidth, right? I'm guessing that speed-wise it
> doesn't really matter whether you use the fancy AVX instructions or
> not, since even the naive implementation is memory bound -- the
> advantage is just in the fusion?
> Lack of loop fusion is obviously a major limitation of numpy, but it's
> a very general problem. I'm sceptical about whether we want to get
> into the business of adding functions whose only purpose is to provide
> pre-fused loops. After madd, what other operations should we provide
> like this? msub (a*b-c)? add3 (a+b+c)? maddm (a*b+c*d)? mult3 (a*b*c)?
> How do we decide? Surely it's better to direct people who are hitting
> memory bottlenecks to much more powerful and general solutions to this
> problem, like numexpr/cython/numba/theano?
> (OTOH the verison that gives rounding guarantees is obviously a unique
> new feature.)
> -n
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

More information about the NumPy-Discussion mailing list