
Francesc Alted wrote:
Well, it is Andrew who should demonstrate that his measurement is correct, but in principle, 4 cycles/item *should* be feasible when using 8 cores in parallel.
But the 100x speed increase is for one core only unless I misread the table. And I should have mentioned that 400 cycles/item for cos is on a pentium 4, which has dreadful performances (defective L1). On a much better core duo extreme something, I get 100 cycles / item (on a 64 bits machines, though, and not same compiler, although I guess the libm version is what matters the most here). And let's not forget that there is the python wrapping cost: by doing everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on the core 2 duo (for double), using the rdtsc performance counter. All this for 1024 items in the array, so very optimistic usecase (everything in cache 2 if not 1). This shows that python wrapping cost is not so high, making the 100x claim a bit doubtful without more details on the way to measure speed. cheers, David