David Cournapeau wrote:
Francesc Alted wrote:
No, that seems good enough. But maybe you can present results in cycles/item. This is a relatively common unit and has the advantage that it does not depend on the frequency of your cores.
Sure, cycles is fine, but I'll argue that in this case the number still does depend on the frequency of the cores, particularly as it relates to the frequency of the memory bus/controllers. A processor with a higher clock rate and higher multiplier may show lower performance when measuring in cycles because the memory bandwidth has not necessarily increased, only the CPU clock rate. Plus between say a xeon and opteron you will have different SSE performance characteristics. So really, any sole number/unit is not sufficient without also describing the system it was obtained on :)
(it seems that I do not receive all emails - I never get the emails from Andrew ?)
I seem to have issues with my emails just disappearing; sometimes they never appear on the list and I have to re-send them.
Concerning the timing: I think generally, you should report the minimum, not the average. The numbers for numpy are strange: 3s to compute 3e6 cos on a 2Ghz core duo (~2000 cycles/item) is very slow. In that sense, taking 20 cycles/item for your optimized version is much more believable, though :)
I can do minimum. My motivation for average was to show a common-case performance an application might see. If that application executes the ufunc many times, the performance will tend towards the average.
I know the usual libm functions are not super fast, specially if high accuracy is not needed. Music softwares and games usually go away with approximations which are quite fast (.e.g using cos+sin evaluation at the same time), but those are generally unacceptable for scientific usage. I think it is critical to always check the result of your implementation, because getting something fast but wrong can waste a lot of your time :) One thing which may be hard to do is correct nan/inf handling. I don't know how SIMD extensions handle this.
I was waiting for someone to bring this up :) I used an implementation that I'm now thinking is not accurate enough for scientific use. But the question is, what is a concrete measure for determining whether some cosine (or other function) implementation is accurate enough? I guess we have precedent in the form of libm's implementation/accuracy tradeoffs, but is that precedent correct? Really answering that question, and coming up with the best possible implementations that meet the requirements, is probably a GSoC project on its own. Andrew