numpy ufuncs and COREPY  any info?
hi all, has anyone already tried to compare using an ordinary numpy ufunc vs that one from corepy, first of all I mean the project http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t1240...
It would be interesting to know what is speedup for (eg) vec ** 0.5 or (if it's possible  it isn't pure ufunc) numpy.dot(Matrix, vec). Or any another example.
dmitrey schrieb:
hi all, has anyone already tried to compare using an ordinary numpy ufunc vs that one from corepy, first of all I mean the project http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t1240...
It would be interesting to know what is speedup for (eg) vec ** 0.5 or (if it's possible  it isn't pure ufunc) numpy.dot(Matrix, vec). Or any another example.
I have no experience with the mentioned CorePy, but recently I was playing around with accelerated ufuncs using Intels Math Kernel Library (MKL). These improvements are now part of the numexpr package http://code.google.com/p/numexpr/ Some remarks on possible speed improvements on recent Intel x86 processors. 1) basic arithmetic ufuncs (add, sub, mul, ...) in standard numpy are fast (SSE is used) and speed is limited by memory bandwidth. 2) the speed of many transcendental functions (exp, sin, cos, pow, ...) can be improved by _roughly_ a factor of five (single core) by using the MKL. Most of the improvements stem from using faster algorithms with a vectorized implementation. Note: the speed improvement depends on a _lot_ of other circumstances. 3) Improving performance by using multi cores is much more difficult. Only for sufficiently large (>1e5) arrays a significant speedup is possible. Where a speed gain is possible, the MKL uses several cores. Some experimentation showed that adding a few OpenMP constructs you could get a similar speedup with numpy. 4) numpy.dot uses optimized implementations.
Gregor
A Friday 22 May 2009 11:42:56 Gregor Thalhammer escrigué:
dmitrey schrieb:
hi all, has anyone already tried to compare using an ordinary numpy ufunc vs that one from corepy, first of all I mean the project http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t1 24024628235
It would be interesting to know what is speedup for (eg) vec ** 0.5 or (if it's possible  it isn't pure ufunc) numpy.dot(Matrix, vec). Or any another example.
I have no experience with the mentioned CorePy, but recently I was playing around with accelerated ufuncs using Intels Math Kernel Library (MKL). These improvements are now part of the numexpr package http://code.google.com/p/numexpr/ Some remarks on possible speed improvements on recent Intel x86 processors.
 basic arithmetic ufuncs (add, sub, mul, ...) in standard numpy are
fast (SSE is used) and speed is limited by memory bandwidth. 2) the speed of many transcendental functions (exp, sin, cos, pow, ...) can be improved by _roughly_ a factor of five (single core) by using the MKL. Most of the improvements stem from using faster algorithms with a vectorized implementation. Note: the speed improvement depends on a _lot_ of other circumstances. 3) Improving performance by using multi cores is much more difficult. Only for sufficiently large (>1e5) arrays a significant speedup is possible. Where a speed gain is possible, the MKL uses several cores. Some experimentation showed that adding a few OpenMP constructs you could get a similar speedup with numpy. 4) numpy.dot uses optimized implementations.
Good points Gregor. However, I wouldn't say that improving performance by using multi cores is *that* difficult, but rather that multi cores can only be used efficiently *whenever* the memory bandwith is not a limitation. An example of this is the computation of transcendental functions, where, even using vectorized implementations, the computation speed is still CPUbounded in many cases. And you have experimented yourself very good speedups for these cases with your implementation of numexpr/MKL :)
Cheers,
Francesc Alted wrote:
A Friday 22 May 2009 11:42:56 Gregor Thalhammer escrigué:
dmitrey schrieb: 3) Improving performance by using multi cores is much more difficult. Only for sufficiently large (>1e5) arrays a significant speedup is possible. Where a speed gain is possible, the MKL uses several cores. Some experimentation showed that adding a few OpenMP constructs you could get a similar speedup with numpy. 4) numpy.dot uses optimized implementations.
Good points Gregor. However, I wouldn't say that improving performance by using multi cores is *that* difficult, but rather that multi cores can only be used efficiently *whenever* the memory bandwith is not a limitation. An example of this is the computation of transcendental functions, where, even using vectorized implementations, the computation speed is still CPUbounded in many cases. And you have experimented yourself very good speedups for these cases with your implementation of numexpr/MKL :)
Using multiple cores is pretty easy for elementwise ufuncs; no communication needs to occur and the work partitioning is trivial. And actually I've found with some initial testing that multiple cores does still help when you are memory bound. I don't fully understand why yet, though I have some ideas. One reason is multiple memory controllers due to multiple sockets (ie opteron). Another is that each thread is pulling memory from a different bank, utilizing more bandwidth than a single sequential thread could. However if that's the case, we could possibly come up with code for a single thread that achieves (nearly) the same additional throughput..
Andrew
A Friday 22 May 2009 13:59:17 Andrew Friedley escrigué:
Using multiple cores is pretty easy for elementwise ufuncs; no communication needs to occur and the work partitioning is trivial. And actually I've found with some initial testing that multiple cores does still help when you are memory bound. I don't fully understand why yet, though I have some ideas. One reason is multiple memory controllers due to multiple sockets (ie opteron).
Yeah. I think this must likely be the reason. If, as in your case, you have several independent paths from different processors to your data, then you can achieve speedups even if you are having a memory bound in a oneprocessor scenario.
Another is that each thread is pulling memory from a different bank, utilizing more bandwidth than a single sequential thread could. However if that's the case, we could possibly come up with code for a single thread that achieves (nearly) the same additional throughput..
Well, I don't think you can achieve important speedups in this case, but experimenting never hurts :)
Good luck!
(sending again)
Hi,
I'm the student doing the project. I have a blog here, which contains some initial performance numbers for a couple test ufuncs I did:
It's really too early yet to give definitive results though; GSoC officially starts in two days :) What I'm finding is that the existing ufuncs are already pretty fast; it appears right now that the main limitation is memory bandwidth. If that's really the case, the performance gains I'll get will be through cache tricks (nontemporal loads/stores), reducing memory accesses and using multiple cores to get more bandwidth.
Another alternative we've talked about, and I (more and more likely) may look into is composing multiple operations together into a single ufunc. Again the main idea being that memory accesses can be reduced/eliminated.
Andrew
dmitrey wrote:
hi all, has anyone already tried to compare using an ordinary numpy ufunc vs that one from corepy, first of all I mean the project http://socghop.appspot.com/student_project/show/google/gsoc2009/python/t1240...
It would be interesting to know what is speedup for (eg) vec ** 0.5 or (if it's possible  it isn't pure ufunc) numpy.dot(Matrix, vec). Or any another example. _______________________________________________ Numpydiscussion mailing list Numpydiscussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpydiscussion
A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué:
(sending again)
Hi,
I'm the student doing the project. I have a blog here, which contains some initial performance numbers for a couple test ufuncs I did:
It's really too early yet to give definitive results though; GSoC officially starts in two days :) What I'm finding is that the existing ufuncs are already pretty fast; it appears right now that the main limitation is memory bandwidth. If that's really the case, the performance gains I'll get will be through cache tricks (nontemporal loads/stores), reducing memory accesses and using multiple cores to get more bandwidth.
Another alternative we've talked about, and I (more and more likely) may look into is composing multiple operations together into a single ufunc. Again the main idea being that memory accesses can be reduced/eliminated.
IMHO, composing multiple operations together is the most promising venue for leveraging current multicore systems.
Another interesting approach is to implement costly operations (from the point of view of CPU resources), namely, transcendental functions like sin, cos or tan, but also others like sqrt or pow) in a parallel way. If besides, you can combine this with vectorized versions of them (by using the well spread SSE2 instruction set, see [1] for an example), then you would be able to achieve really good results for sure (at least Intel did with its VML library ;)
[1] http://gruntthepeon.free.fr/ssemath/
Cheers,
For some reason the list seems to occasionally drop my messages...
Francesc Alted wrote:
A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué:
I'm the student doing the project. I have a blog here, which contains some initial performance numbers for a couple test ufuncs I did:
Another alternative we've talked about, and I (more and more likely) may look into is composing multiple operations together into a single ufunc. Again the main idea being that memory accesses can be reduced/eliminated.
IMHO, composing multiple operations together is the most promising venue for leveraging current multicore systems.
Agreed  our concern when considering for the project was to keep the scope reasonable so I can complete it in the GSoC timeframe. If I have time I'll definitely be looking into this over the summer; if not later.
Another interesting approach is to implement costly operations (from the point of view of CPU resources), namely, transcendental functions like sin, cos or tan, but also others like sqrt or pow) in a parallel way. If besides, you can combine this with vectorized versions of them (by using the well spread SSE2 instruction set, see [1] for an example), then you would be able to achieve really good results for sure (at least Intel did with its VML library ;)
I've seen that page before. Using another source [1] I came up with a quick/dirty cos ufunc. Performance is crazy good compared to NumPy (100x); see the latest post on my blog for a little more info. I'll look at the source myself when I get time again, but is NumPy using a Pythonbased cos function, a C implementation, or something else? As I wrote in my blog, the performance gain is almost too good to believe.
[1] http://www.devmaster.net/forums/showthread.php?t=5784
Andrew
On Mon, May 25, 2009 at 4:59 AM, Andrew Friedley afriedle@indiana.eduwrote:
For some reason the list seems to occasionally drop my messages...
Francesc Alted wrote:
A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué:
I'm the student doing the project. I have a blog here, which contains some initial performance numbers for a couple test ufuncs I did:
Another alternative we've talked about, and I (more and more likely) may look into is composing multiple operations together into a single ufunc. Again the main idea being that memory accesses can be
reduced/eliminated.
IMHO, composing multiple operations together is the most promising venue
for
leveraging current multicore systems.
Agreed  our concern when considering for the project was to keep the scope reasonable so I can complete it in the GSoC timeframe. If I have time I'll definitely be looking into this over the summer; if not later.
Another interesting approach is to implement costly operations (from the
point
of view of CPU resources), namely, transcendental functions like sin, cos
or
tan, but also others like sqrt or pow) in a parallel way. If besides,
you can
combine this with vectorized versions of them (by using the well spread
SSE2
instruction set, see [1] for an example), then you would be able to
achieve
really good results for sure (at least Intel did with its VML library ;)
I've seen that page before. Using another source [1] I came up with a quick/dirty cos ufunc. Performance is crazy good compared to NumPy (100x); see the latest post on my blog for a little more info. I'll look at the source myself when I get time again, but is NumPy using a Pythonbased cos function, a C implementation, or something else? As I wrote in my blog, the performance gain is almost too good to believe.
Numpy uses the C library version. If long double and float aren't available the double version is used with number conversions, but that shouldn't give a factor of 100x. Something else is going on.
Chuck
Charles R Harris wrote:
On Mon, May 25, 2009 at 4:59 AM, Andrew Friedley <afriedle@indiana.edu mailto:afriedle@indiana.edu> wrote:
For some reason the list seems to occasionally drop my messages... Francesc Alted wrote: > A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué: >> I'm the student doing the project. I have a blog here, which contains >> some initial performance numbers for a couple test ufuncs I did: >> >> http://numcorepy.blogspot.com >> Another alternative we've talked about, and I (more and more likely) may >> look into is composing multiple operations together into a single ufunc. >> Again the main idea being that memory accesses can be reduced/eliminated. > > IMHO, composing multiple operations together is the most promising venue for > leveraging current multicore systems. Agreed  our concern when considering for the project was to keep the scope reasonable so I can complete it in the GSoC timeframe. If I have time I'll definitely be looking into this over the summer; if not later. > Another interesting approach is to implement costly operations (from the point > of view of CPU resources), namely, transcendental functions like sin, cos or > tan, but also others like sqrt or pow) in a parallel way. If besides, you can > combine this with vectorized versions of them (by using the well spread SSE2 > instruction set, see [1] for an example), then you would be able to achieve > really good results for sure (at least Intel did with its VML library ;) > > [1] http://gruntthepeon.free.fr/ssemath/ I've seen that page before. Using another source [1] I came up with a quick/dirty cos ufunc. Performance is crazy good compared to NumPy (100x); see the latest post on my blog for a little more info. I'll look at the source myself when I get time again, but is NumPy using a Pythonbased cos function, a C implementation, or something else? As I wrote in my blog, the performance gain is almost too good to believe.
Numpy uses the C library version. If long double and float aren't available the double version is used with number conversions, but that shouldn't give a factor of 100x. Something else is going on.
I think something is wrong with the measurement method  on my machine, computing the cos of an array of double takes roughly ~400 cycles/item for arrays with a reasonable size (> 1e3 items). Taking 4 cycles/item for cos would be very impressive :)
David
A Tuesday 26 May 2009 03:11:56 David Cournapeau escrigué:
Charles R Harris wrote:
On Mon, May 25, 2009 at 4:59 AM, Andrew Friedley <afriedle@indiana.edu mailto:afriedle@indiana.edu> wrote:
For some reason the list seems to occasionally drop my messages... Francesc Alted wrote: > A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué: >> I'm the student doing the project. I have a blog here, which contains >> some initial performance numbers for a couple test ufuncs I did: >> >> http://numcorepy.blogspot.com >> >> Another alternative we've talked about, and I (more and more likely) may >> look into is composing multiple operations together into a single ufunc. >> Again the main idea being that memory accesses can be reduced/eliminated. > IMHO, composing multiple operations together is the most promising venue for > leveraging current multicore systems. Agreed  our concern when considering for the project was to keep
the scope reasonable so I can complete it in the GSoC timeframe. If I have time I'll definitely be looking into this over the summer; if not later.
> Another interesting approach is to implement costly operations (from the point > of view of CPU resources), namely, transcendental functions like sin, cos or > tan, but also others like sqrt or pow) in a parallel way. If besides, you can > combine this with vectorized versions of them (by using the well spread SSE2 > instruction set, see [1] for an example), then you would be able to achieve > really good results for sure (at least Intel did with its VML library ;) > [1] http://gruntthepeon.free.fr/ssemath/ I've seen that page before. Using another source [1] I came up with
a quick/dirty cos ufunc. Performance is crazy good compared to NumPy (100x); see the latest post on my blog for a little more info. I'll look at the source myself when I get time again, but is NumPy using a Pythonbased cos function, a C implementation, or something else? As I wrote in my blog, the performance gain is almost too good to believe.
Numpy uses the C library version. If long double and float aren't available the double version is used with number conversions, but that shouldn't give a factor of 100x. Something else is going on.
I think something is wrong with the measurement method  on my machine, computing the cos of an array of double takes roughly ~400 cycles/item for arrays with a reasonable size (> 1e3 items). Taking 4 cycles/item for cos would be very impressive :)
Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel. In [1] one can see how Intel achieves (with his VML kernel) to compute a cos() in less than 23 cycles in one single core. Having 8 cores in parallel would allow, in theory, reach 3 cycles/item.
[1]http://www.intel.com/software/products/mkl/data/vml/functions/_performanceal...
Francesc Alted wrote:
Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel.
But the 100x speed increase is for one core only unless I misread the table. And I should have mentioned that 400 cycles/item for cos is on a pentium 4, which has dreadful performances (defective L1). On a much better core duo extreme something, I get 100 cycles / item (on a 64 bits machines, though, and not same compiler, although I guess the libm version is what matters the most here).
And let's not forget that there is the python wrapping cost: by doing everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on the core 2 duo (for double), using the rdtsc performance counter. All this for 1024 items in the array, so very optimistic usecase (everything in cache 2 if not 1).
This shows that python wrapping cost is not so high, making the 100x claim a bit doubtful without more details on the way to measure speed.
cheers,
David
David Cournapeau wrote:
Francesc Alted wrote:
Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel.
But the 100x speed increase is for one core only unless I misread the table. And I should have mentioned that 400 cycles/item for cos is on a pentium 4, which has dreadful performances (defective L1). On a much better core duo extreme something, I get 100 cycles / item (on a 64 bits machines, though, and not same compiler, although I guess the libm version is what matters the most here).
And let's not forget that there is the python wrapping cost: by doing everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on the core 2 duo (for double), using the rdtsc performance counter. All this for 1024 items in the array, so very optimistic usecase (everything in cache 2 if not 1).
This shows that python wrapping cost is not so high, making the 100x claim a bit doubtful without more details on the way to measure speed.
I appreciate all the discussion this is creating. I wish I could work on this more right now; I have a big paper deadline coming up June 1 that I need to focus on.
Yes, you're reading the table right. I should have been more clear on what my implementation is doing. It's using SIMD, so performing 4 cosine's at a time where a libm cosine is only doing one. Also I don't think libm trancendentals are known for being fast; I'm also likely gaining performance by using a welloptimized but less accurate approximation. In fact a little more inspection shows my accuracy decreases as the input values increase; I will probably need to take a performance hit to fix this.
I went and wrote code to use the libm fcos() routine instead of my cos code. Performance is equivalent to numpy, plus an overhead:
inp sizes 1024 10240 102400 1024000 3072000 numpy 0.7282 9.6278 115.5976 993.5738 3017.3680
lmcos 1 0.7594 9.7579 116.7135 1039.5783 3156.8371 lmcos 2 0.5274 5.7885 61.8052 537.8451 1576.2057 lmcos 4 0.5172 5.1240 40.5018 313.2487 791.9730
corepy 1 0.0142 0.0880 0.9566 9.6162 28.4972 corepy 2 0.0342 0.0754 0.6991 6.1647 15.3545 corepy 4 0.0596 0.0963 0.5671 4.9499 13.8784
The times I show are in milliseconds; the system used is a dualsocket dualcore 2ghz opteron. I'm testing at the ufunc level, like this:
def benchmark(fn, args): avgtime = 0 fn(*args)
for i in xrange(7): t1 = time.time() fn(*args) t2 = time.time()
tm = t2  t1 avgtime += tm
return avgtime / 7
Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then do 7 timings and take the average. I always appreciate suggestions on better way to benchmark things.
Andrew
A Tuesday 26 May 2009 15:14:39 Andrew Friedley escrigué:
David Cournapeau wrote:
Francesc Alted wrote:
Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel.
But the 100x speed increase is for one core only unless I misread the table. And I should have mentioned that 400 cycles/item for cos is on a pentium 4, which has dreadful performances (defective L1). On a much better core duo extreme something, I get 100 cycles / item (on a 64 bits machines, though, and not same compiler, although I guess the libm version is what matters the most here).
And let's not forget that there is the python wrapping cost: by doing everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on the core 2 duo (for double), using the rdtsc performance counter. All this for 1024 items in the array, so very optimistic usecase (everything in cache 2 if not 1).
This shows that python wrapping cost is not so high, making the 100x claim a bit doubtful without more details on the way to measure speed.
I appreciate all the discussion this is creating. I wish I could work on this more right now; I have a big paper deadline coming up June 1 that I need to focus on.
Yes, you're reading the table right. I should have been more clear on what my implementation is doing. It's using SIMD, so performing 4 cosine's at a time where a libm cosine is only doing one. Also I don't think libm trancendentals are known for being fast; I'm also likely gaining performance by using a welloptimized but less accurate approximation. In fact a little more inspection shows my accuracy decreases as the input values increase; I will probably need to take a performance hit to fix this.
I went and wrote code to use the libm fcos() routine instead of my cos code. Performance is equivalent to numpy, plus an overhead:
inp sizes 1024 10240 102400 1024000 3072000 numpy 0.7282 9.6278 115.5976 993.5738 3017.3680
lmcos 1 0.7594 9.7579 116.7135 1039.5783 3156.8371 lmcos 2 0.5274 5.7885 61.8052 537.8451 1576.2057 lmcos 4 0.5172 5.1240 40.5018 313.2487 791.9730
corepy 1 0.0142 0.0880 0.9566 9.6162 28.4972 corepy 2 0.0342 0.0754 0.6991 6.1647 15.3545 corepy 4 0.0596 0.0963 0.5671 4.9499 13.8784
The times I show are in milliseconds; the system used is a dualsocket dualcore 2ghz opteron. I'm testing at the ufunc level, like this:
def benchmark(fn, args): avgtime = 0 fn(*args)
for i in xrange(7): t1 = time.time() fn(*args) t2 = time.time()
tm = t2  t1 avgtime += tm
return avgtime / 7
Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then do 7 timings and take the average. I always appreciate suggestions on better way to benchmark things.
No, that seems good enough. But maybe you can present results in cycles/item. This is a relatively common unit and has the advantage that it does not depend on the frequency of your cores.
Francesc Alted wrote:
A Tuesday 26 May 2009 15:14:39 Andrew Friedley escrigué:
David Cournapeau wrote:
Francesc Alted wrote:
Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel.
But the 100x speed increase is for one core only unless I misread the table. And I should have mentioned that 400 cycles/item for cos is on a pentium 4, which has dreadful performances (defective L1). On a much better core duo extreme something, I get 100 cycles / item (on a 64 bits machines, though, and not same compiler, although I guess the libm version is what matters the most here).
And let's not forget that there is the python wrapping cost: by doing everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on the core 2 duo (for double), using the rdtsc performance counter. All this for 1024 items in the array, so very optimistic usecase (everything in cache 2 if not 1).
This shows that python wrapping cost is not so high, making the 100x claim a bit doubtful without more details on the way to measure speed.
I appreciate all the discussion this is creating. I wish I could work on this more right now; I have a big paper deadline coming up June 1 that I need to focus on.
Yes, you're reading the table right. I should have been more clear on what my implementation is doing. It's using SIMD, so performing 4 cosine's at a time where a libm cosine is only doing one. Also I don't think libm trancendentals are known for being fast; I'm also likely gaining performance by using a welloptimized but less accurate approximation. In fact a little more inspection shows my accuracy decreases as the input values increase; I will probably need to take a performance hit to fix this.
I went and wrote code to use the libm fcos() routine instead of my cos code. Performance is equivalent to numpy, plus an overhead:
inp sizes 1024 10240 102400 1024000 3072000 numpy 0.7282 9.6278 115.5976 993.5738 3017.3680
lmcos 1 0.7594 9.7579 116.7135 1039.5783 3156.8371 lmcos 2 0.5274 5.7885 61.8052 537.8451 1576.2057 lmcos 4 0.5172 5.1240 40.5018 313.2487 791.9730
corepy 1 0.0142 0.0880 0.9566 9.6162 28.4972 corepy 2 0.0342 0.0754 0.6991 6.1647 15.3545 corepy 4 0.0596 0.0963 0.5671 4.9499 13.8784
The times I show are in milliseconds; the system used is a dualsocket dualcore 2ghz opteron. I'm testing at the ufunc level, like this:
def benchmark(fn, args): avgtime = 0 fn(*args)
for i in xrange(7): t1 = time.time() fn(*args) t2 = time.time()
tm = t2  t1 avgtime += tm
return avgtime / 7
Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then do 7 timings and take the average. I always appreciate suggestions on better way to benchmark things.
No, that seems good enough. But maybe you can present results in cycles/item. This is a relatively common unit and has the advantage that it does not depend on the frequency of your cores.
(it seems that I do not receive all emails  I never get the emails from Andrew ?)
Concerning the timing: I think generally, you should report the minimum, not the average. The numbers for numpy are strange: 3s to compute 3e6 cos on a 2Ghz core duo (~2000 cycles/item) is very slow. In that sense, taking 20 cycles/item for your optimized version is much more believable, though :)
I know the usual libm functions are not super fast, specially if high accuracy is not needed. Music softwares and games usually go away with approximations which are quite fast (.e.g using cos+sin evaluation at the same time), but those are generally unacceptable for scientific usage. I think it is critical to always check the result of your implementation, because getting something fast but wrong can waste a lot of your time :) One thing which may be hard to do is correct nan/inf handling. I don't know how SIMD extensions handle this.
cheers,
David
David Cournapeau wrote:
Francesc Alted wrote:
No, that seems good enough. But maybe you can present results in cycles/item. This is a relatively common unit and has the advantage that it does not depend on the frequency of your cores.
Sure, cycles is fine, but I'll argue that in this case the number still does depend on the frequency of the cores, particularly as it relates to the frequency of the memory bus/controllers. A processor with a higher clock rate and higher multiplier may show lower performance when measuring in cycles because the memory bandwidth has not necessarily increased, only the CPU clock rate. Plus between say a xeon and opteron you will have different SSE performance characteristics. So really, any sole number/unit is not sufficient without also describing the system it was obtained on :)
(it seems that I do not receive all emails  I never get the emails from Andrew ?)
I seem to have issues with my emails just disappearing; sometimes they never appear on the list and I have to resend them.
Concerning the timing: I think generally, you should report the minimum, not the average. The numbers for numpy are strange: 3s to compute 3e6 cos on a 2Ghz core duo (~2000 cycles/item) is very slow. In that sense, taking 20 cycles/item for your optimized version is much more believable, though :)
I can do minimum. My motivation for average was to show a commoncase performance an application might see. If that application executes the ufunc many times, the performance will tend towards the average.
I know the usual libm functions are not super fast, specially if high accuracy is not needed. Music softwares and games usually go away with approximations which are quite fast (.e.g using cos+sin evaluation at the same time), but those are generally unacceptable for scientific usage. I think it is critical to always check the result of your implementation, because getting something fast but wrong can waste a lot of your time :) One thing which may be hard to do is correct nan/inf handling. I don't know how SIMD extensions handle this.
I was waiting for someone to bring this up :) I used an implementation that I'm now thinking is not accurate enough for scientific use. But the question is, what is a concrete measure for determining whether some cosine (or other function) implementation is accurate enough? I guess we have precedent in the form of libm's implementation/accuracy tradeoffs, but is that precedent correct?
Really answering that question, and coming up with the best possible implementations that meet the requirements, is probably a GSoC project on its own.
Andrew
Andrew Friedley wrote:
David Cournapeau wrote:
Francesc Alted wrote:
No, that seems good enough. But maybe you can present results in cycles/item. This is a relatively common unit and has the advantage that it does not depend on the frequency of your cores.
Sure, cycles is fine, but I'll argue that in this case the number still does depend on the frequency of the cores, particularly as it relates to the frequency of the memory bus/controllers. A processor with a higher clock rate and higher multiplier may show lower performance when measuring in cycles because the memory bandwidth has not necessarily increased, only the CPU clock rate. Plus between say a xeon and opteron you will have different SSE performance characteristics. So really, any sole number/unit is not sufficient without also describing the system it was obtained on :)
Yes, that's why people usually add the CPU type with the cycles/operation count :) It makes comparison easier. Sure, the comparison is not accurate because differences in CPU may make a difference. But with cycles/computation, we could see right away that something was strange with the numpy timing, so I think it is a better representation for discussion/comoparison.
I can do minimum. My motivation for average was to show a commoncase performance an application might see. If that application executes the ufunc many times, the performance will tend towards the average.
The rationale for minimum is to remove external factors like other tasks taking CPU, etc...
I was waiting for someone to bring this up :) I used an implementation that I'm now thinking is not accurate enough for scientific use. But the question is, what is a concrete measure for determining whether some cosine (or other function) implementation is accurate enough?
Nan/inf/zero handling should be tested for every function (the exact behavior for standard functions is part of the C standard), and then, the particular values depend on the function and implementation. If your implementation has several codepath, each codepath should be tested. But really, most implementations just test for a few more or less random known values. I know the GNU libc has some tests for the math library, for example.
For single precision, brute force testing against a reference implementation for every possible input is actually feasible, too :)
David
A Monday 25 May 2009 12:59:31 Andrew Friedley escrigué:
For some reason the list seems to occasionally drop my messages...
Francesc Alted wrote:
A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué:
I'm the student doing the project. I have a blog here, which contains some initial performance numbers for a couple test ufuncs I did:
Another alternative we've talked about, and I (more and more likely) may look into is composing multiple operations together into a single ufunc. Again the main idea being that memory accesses can be reduced/eliminated.
IMHO, composing multiple operations together is the most promising venue for leveraging current multicore systems.
Agreed  our concern when considering for the project was to keep the scope reasonable so I can complete it in the GSoC timeframe. If I have time I'll definitely be looking into this over the summer; if not later.
You should know that Numexpr has already started this path for some time now. The fact that it already can evaluate complex array expressions like 'a+b*cos(c)' without using temporaries (like NumPy does) should allow it to use multiple cores without stressing the memory bus too much. I'm planning to implement such parallelism in Numexpr for some time now, but not there yet.
I've seen that page before. Using another source [1] I came up with a quick/dirty cos ufunc. Performance is crazy good compared to NumPy (100x); see the latest post on my blog for a little more info. I'll look at the source myself when I get time again, but is NumPy using a Pythonbased cos function, a C implementation, or something else? As I wrote in my blog, the performance gain is almost too good to believe.
100x? Uh, sounds really impressing...
participants (6)

Andrew Friedley

Charles R Harris

David Cournapeau

dmitrey

Francesc Alted

Gregor Thalhammer