![](https://secure.gravatar.com/avatar/d00fca665bd15b6a2bd7006fcd84dd78.jpg?s=120&d=mm&r=g)
Hi, Is it possible to get access to versions of ufuncs like sin and cos but compiled with the -ffast-math compiler switch? I recently noticed that my weave.inline code was much faster for some fairly simple operations than my pure numpy code, and realised after some fiddling around that it was due to using this switch. The speed difference was enormous in my test example. I checked the difference in accuracy and over the range of values I'm interested in the errors are not really significant. If there's currently no way of using these faster versions of these functions in numpy, maybe it would be worth adding as a feature for future versions? Thanks, Dan Goodman
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 29.11.2013 21:15, Dan Goodman wrote:
can you show the code that is slow in numpy? which version of gcc and libc are you using? with gcc 4.8 it uses the glibc 2.17 sin/cos with fast-math, so there should be no difference.
it might be useful for some purposes to add a sort of precision context which allows using faster but less accurate functions, hypot is a another case where it would be useful. But its probably a rather large change and the applications for it are limited in numpy. E.g. the main advantage of ffast-math is for vectorization and complex numbers. For the former numpy can not merge operations like numexpr and the simple loops are already vectorized. For complex numbers numpy already implements them as if #pragma STDC CX_LIMITED_RANGE is enabled (python does the same).
![](https://secure.gravatar.com/avatar/d00fca665bd15b6a2bd7006fcd84dd78.jpg?s=120&d=mm&r=g)
Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
In trying to write some simple code to demonstrate it, I realised it was weirdly more complicated than I thought. Previously I had been comparing numpy against weave on a complicated expression, namely a*sin(2.0*freq*pi*t) + b + v*exp(-dt/tau) + (-a*sin(2.0*freq*pi*t) - b)*exp(-dt/tau). Doing that with weave and no -ffast-math took the same time as numpy approximately, but with weave and -ffast-math it was about 30x faster. Here only a and v are arrays. Since numpy and weave with no -ffast-math took about the same time I assumed it wasn't memory bound but to do with the -ffast-math. Here's the demo code (you might need to comment a couple of lines out if you want to actually run it, since it also tests a couple of things that depend on a library): http://bit.ly/IziH8H However, when I did a simple example that just computed y=sin(x) for arrays x and y, I found that numpy and weave without -ffast-math took about the same time, but weave with -ffast-math was significantly slower than numpy! My take home message from this: optimisation is weird. Could it be that -ffast-math and -O3 allow SSE instructions and that there is some overhead to this that makes it worth it for a complex expression but not for a simple expression? Here's the code for the simple example (doesn't have any dependencies): http://bit.ly/18wdCKY For reference, I'm on a newish 64 bit windows machine running 32 bit Python 2.7.3, gcc version 4.5.2, numpy 1.8.0 installed from binaries. Dan
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 01.12.2013 21:53, Dan Goodman wrote:
this should be the code: int N = _N; for(int _idx=0; _idx<N; _idx++) { double a = _array_neurongroup_a[_idx]; double v = _array_neurongroup_v[_idx]; double _v = a*sin(2.0*freq*pi*t) + b + v*exp(-dt/tau) + (-a*sin(2.0*freq*pi*t) - b)*exp(-dt/tau); v = _v; _array_neurongroup_v[_idx] = v; } your sin and exp calls are loop invariants, they do not depend on the loop iterable. This allows to move the expensive functions out of the loop and only leave some simple arithmetic in the body. Unfortunately ieee754 floating point (gcc's default mode) does not allow this type of transformation, they are not associative, you have special values to propagate and sticky exceptions to preserve, set errno, etc. All this prevents gcc from doing this in its default mode. -ffast-math tells it to ignore all these things and just make it fast, so it will do the loop invariant transformation in this case. In this case setting -fno-math-errno, which disables taking care of setting errno as the C standard requires, seems to be enough. In pure numpy you have to do these types of transformations yourself as cpython has no optimizer which does this type of loop invariant optimizations.
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 01.12.2013 22:59, Dan Goodman wrote:
no on my linux machine ffast-math is a little faster: numpy: 311 ms weave_slow: 291 ms weave_fast: 262 ms here is a pure numpy version of your calculation which only performs 3 times worse than weave: def timefunc_numpy2(a, v): ext = exp(-dt/tau) sit = sin(2.0*freq*pi*t) bs = 20000 for i in range(0, N, bs): ab = a[i:i+bs] vb = v[i:i+bs] absit = ab*sit + b vb *= ext vb += absit vb -= absit*ext it works by replacing temporaries with inplace operations and blocks the operations to be more memory cache friendlier. using numexpr should give you similar results.
![](https://secure.gravatar.com/avatar/d00fca665bd15b6a2bd7006fcd84dd78.jpg?s=120&d=mm&r=g)
Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
Maybe something to do with my older version of gcc (4.5)?
I was working on something similar without the blocking and also got good results. Actually, your version with blocking doesn't give me as good performance on my machine, it's around 6x slower than weave. I tried different sizes for the block size but couldn't improve much on that. Using this unblocked code: def timefunc_numpy_smart(): _sin_term = sin(2.0*freq*pi*t) _exp_term = exp(-dt/tau) _a_term = (_sin_term-_sin_term*_exp_term) _v = v _v *= _exp_term _v += a*_a_term _v += -b*_exp_term + b I got around 5x slower. Using numexpr 'dumbly' (i.e. just putting the expression in directly) was slower than the function above, but doing a hybrid between the two approaches worked well: def timefunc_numexpr_smart(): _sin_term = sin(2.0*freq*pi*t) _exp_term = exp(-dt/tau) _a_term = (_sin_term-_sin_term*_exp_term) _const_term = -b*_exp_term + b v[:] = numexpr.evaluate('a*_a_term+v*_exp_term+_const_term') #numexpr.evaluate('a*_a_term+v*_exp_term+_const_term', out=v) This was about 3.5x slower than weave. If I used the commented out final line then it was only 1.5x slower than weave, but it also gives wrong results. I reported this as a bug in numexpr a long time ago but I guess it hasn't been fixed yet (or maybe I didn't upgrade my version recently). Dan
![](https://secure.gravatar.com/avatar/38153b4768acea6b89aed9f19a0a5243.jpg?s=120&d=mm&r=g)
On 12/2/13, 12:14 AM, Dan Goodman wrote:
Err no, there have not been performance improvements in numexpr since 2.0 (that I am aware of). Maybe you are running in a multi-core machine now and you are seeing better speedup because of this? Also, your expressions are made of transcendental functions, so linking numexpr with MKL could accelerate computations a good deal too. -- Francesc Alted
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 29.11.2013 21:15, Dan Goodman wrote:
can you show the code that is slow in numpy? which version of gcc and libc are you using? with gcc 4.8 it uses the glibc 2.17 sin/cos with fast-math, so there should be no difference.
it might be useful for some purposes to add a sort of precision context which allows using faster but less accurate functions, hypot is a another case where it would be useful. But its probably a rather large change and the applications for it are limited in numpy. E.g. the main advantage of ffast-math is for vectorization and complex numbers. For the former numpy can not merge operations like numexpr and the simple loops are already vectorized. For complex numbers numpy already implements them as if #pragma STDC CX_LIMITED_RANGE is enabled (python does the same).
![](https://secure.gravatar.com/avatar/d00fca665bd15b6a2bd7006fcd84dd78.jpg?s=120&d=mm&r=g)
Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
In trying to write some simple code to demonstrate it, I realised it was weirdly more complicated than I thought. Previously I had been comparing numpy against weave on a complicated expression, namely a*sin(2.0*freq*pi*t) + b + v*exp(-dt/tau) + (-a*sin(2.0*freq*pi*t) - b)*exp(-dt/tau). Doing that with weave and no -ffast-math took the same time as numpy approximately, but with weave and -ffast-math it was about 30x faster. Here only a and v are arrays. Since numpy and weave with no -ffast-math took about the same time I assumed it wasn't memory bound but to do with the -ffast-math. Here's the demo code (you might need to comment a couple of lines out if you want to actually run it, since it also tests a couple of things that depend on a library): http://bit.ly/IziH8H However, when I did a simple example that just computed y=sin(x) for arrays x and y, I found that numpy and weave without -ffast-math took about the same time, but weave with -ffast-math was significantly slower than numpy! My take home message from this: optimisation is weird. Could it be that -ffast-math and -O3 allow SSE instructions and that there is some overhead to this that makes it worth it for a complex expression but not for a simple expression? Here's the code for the simple example (doesn't have any dependencies): http://bit.ly/18wdCKY For reference, I'm on a newish 64 bit windows machine running 32 bit Python 2.7.3, gcc version 4.5.2, numpy 1.8.0 installed from binaries. Dan
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 01.12.2013 21:53, Dan Goodman wrote:
this should be the code: int N = _N; for(int _idx=0; _idx<N; _idx++) { double a = _array_neurongroup_a[_idx]; double v = _array_neurongroup_v[_idx]; double _v = a*sin(2.0*freq*pi*t) + b + v*exp(-dt/tau) + (-a*sin(2.0*freq*pi*t) - b)*exp(-dt/tau); v = _v; _array_neurongroup_v[_idx] = v; } your sin and exp calls are loop invariants, they do not depend on the loop iterable. This allows to move the expensive functions out of the loop and only leave some simple arithmetic in the body. Unfortunately ieee754 floating point (gcc's default mode) does not allow this type of transformation, they are not associative, you have special values to propagate and sticky exceptions to preserve, set errno, etc. All this prevents gcc from doing this in its default mode. -ffast-math tells it to ignore all these things and just make it fast, so it will do the loop invariant transformation in this case. In this case setting -fno-math-errno, which disables taking care of setting errno as the C standard requires, seems to be enough. In pure numpy you have to do these types of transformations yourself as cpython has no optimizer which does this type of loop invariant optimizations.
![](https://secure.gravatar.com/avatar/c0da24f75f763b6bac90b519064f30b3.jpg?s=120&d=mm&r=g)
On 01.12.2013 22:59, Dan Goodman wrote:
no on my linux machine ffast-math is a little faster: numpy: 311 ms weave_slow: 291 ms weave_fast: 262 ms here is a pure numpy version of your calculation which only performs 3 times worse than weave: def timefunc_numpy2(a, v): ext = exp(-dt/tau) sit = sin(2.0*freq*pi*t) bs = 20000 for i in range(0, N, bs): ab = a[i:i+bs] vb = v[i:i+bs] absit = ab*sit + b vb *= ext vb += absit vb -= absit*ext it works by replacing temporaries with inplace operations and blocks the operations to be more memory cache friendlier. using numexpr should give you similar results.
![](https://secure.gravatar.com/avatar/d00fca665bd15b6a2bd7006fcd84dd78.jpg?s=120&d=mm&r=g)
Julian Taylor <jtaylor.debian <at> googlemail.com> writes:
Maybe something to do with my older version of gcc (4.5)?
I was working on something similar without the blocking and also got good results. Actually, your version with blocking doesn't give me as good performance on my machine, it's around 6x slower than weave. I tried different sizes for the block size but couldn't improve much on that. Using this unblocked code: def timefunc_numpy_smart(): _sin_term = sin(2.0*freq*pi*t) _exp_term = exp(-dt/tau) _a_term = (_sin_term-_sin_term*_exp_term) _v = v _v *= _exp_term _v += a*_a_term _v += -b*_exp_term + b I got around 5x slower. Using numexpr 'dumbly' (i.e. just putting the expression in directly) was slower than the function above, but doing a hybrid between the two approaches worked well: def timefunc_numexpr_smart(): _sin_term = sin(2.0*freq*pi*t) _exp_term = exp(-dt/tau) _a_term = (_sin_term-_sin_term*_exp_term) _const_term = -b*_exp_term + b v[:] = numexpr.evaluate('a*_a_term+v*_exp_term+_const_term') #numexpr.evaluate('a*_a_term+v*_exp_term+_const_term', out=v) This was about 3.5x slower than weave. If I used the commented out final line then it was only 1.5x slower than weave, but it also gives wrong results. I reported this as a bug in numexpr a long time ago but I guess it hasn't been fixed yet (or maybe I didn't upgrade my version recently). Dan
![](https://secure.gravatar.com/avatar/38153b4768acea6b89aed9f19a0a5243.jpg?s=120&d=mm&r=g)
On 12/2/13, 12:14 AM, Dan Goodman wrote:
Err no, there have not been performance improvements in numexpr since 2.0 (that I am aware of). Maybe you are running in a multi-core machine now and you are seeing better speedup because of this? Also, your expressions are made of transcendental functions, so linking numexpr with MKL could accelerate computations a good deal too. -- Francesc Alted
participants (4)
-
Dan Goodman
-
Francesc Alted
-
Julian Taylor
-
Pauli Virtanen