Francesc Alted wrote:
A Tuesday 26 May 2009 15:14:39 Andrew Friedley escrigué:
David Cournapeau wrote:
Francesc Alted wrote:
Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel.
But the 100x speed increase is for one core only unless I misread the table. And I should have mentioned that 400 cycles/item for cos is on a pentium 4, which has dreadful performances (defective L1). On a much better core duo extreme something, I get 100 cycles / item (on a 64 bits machines, though, and not same compiler, although I guess the libm version is what matters the most here).
And let's not forget that there is the python wrapping cost: by doing everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on the core 2 duo (for double), using the rdtsc performance counter. All this for 1024 items in the array, so very optimistic usecase (everything in cache 2 if not 1).
This shows that python wrapping cost is not so high, making the 100x claim a bit doubtful without more details on the way to measure speed.
I appreciate all the discussion this is creating. I wish I could work on this more right now; I have a big paper deadline coming up June 1 that I need to focus on.
Yes, you're reading the table right. I should have been more clear on what my implementation is doing. It's using SIMD, so performing 4 cosine's at a time where a libm cosine is only doing one. Also I don't think libm trancendentals are known for being fast; I'm also likely gaining performance by using a well-optimized but less accurate approximation. In fact a little more inspection shows my accuracy decreases as the input values increase; I will probably need to take a performance hit to fix this.
I went and wrote code to use the libm fcos() routine instead of my cos code. Performance is equivalent to numpy, plus an overhead:
inp sizes 1024 10240 102400 1024000 3072000 numpy 0.7282 9.6278 115.5976 993.5738 3017.3680
lmcos 1 0.7594 9.7579 116.7135 1039.5783 3156.8371 lmcos 2 0.5274 5.7885 61.8052 537.8451 1576.2057 lmcos 4 0.5172 5.1240 40.5018 313.2487 791.9730
corepy 1 0.0142 0.0880 0.9566 9.6162 28.4972 corepy 2 0.0342 0.0754 0.6991 6.1647 15.3545 corepy 4 0.0596 0.0963 0.5671 4.9499 13.8784
The times I show are in milliseconds; the system used is a dual-socket dual-core 2ghz opteron. I'm testing at the ufunc level, like this:
def benchmark(fn, args): avgtime = 0 fn(*args)
for i in xrange(7): t1 = time.time() fn(*args) t2 = time.time()
tm = t2 - t1 avgtime += tm
return avgtime / 7
Where fn is a ufunc, ie numpy.cos. So I prime the execution once, then do 7 timings and take the average. I always appreciate suggestions on better way to benchmark things.
No, that seems good enough. But maybe you can present results in cycles/item. This is a relatively common unit and has the advantage that it does not depend on the frequency of your cores.
(it seems that I do not receive all emails - I never get the emails from Andrew ?) Concerning the timing: I think generally, you should report the minimum, not the average. The numbers for numpy are strange: 3s to compute 3e6 cos on a 2Ghz core duo (~2000 cycles/item) is very slow. In that sense, taking 20 cycles/item for your optimized version is much more believable, though :) I know the usual libm functions are not super fast, specially if high accuracy is not needed. Music softwares and games usually go away with approximations which are quite fast (.e.g using cos+sin evaluation at the same time), but those are generally unacceptable for scientific usage. I think it is critical to always check the result of your implementation, because getting something fast but wrong can waste a lot of your time :) One thing which may be hard to do is correct nan/inf handling. I don't know how SIMD extensions handle this. cheers, David