[Numpy-discussion] numpy ufuncs and COREPY - any info?

Thu May 28 03:02:12 EDT 2009

Francesc Alted wrote:
> A Tuesday 26 May 2009 15:14:39 Andrew Friedley escrigué:
>   
>> David Cournapeau wrote:
>>     
>>> Francesc Alted wrote:
>>>       
>>>> Well, it is Andrew who should demonstrate that his measurement is
>>>> correct, but in principle, 4 cycles/item *should* be feasible when using
>>>> 8 cores in parallel.
>>>>         
>>> But the 100x speed increase is for one core only unless I misread the
>>> table. And I should have mentioned that 400 cycles/item for cos is on a
>>> pentium 4, which has dreadful performances (defective L1). On a much
>>> better core duo extreme something, I get 100 cycles / item (on a 64 bits
>>> machines, though, and not same compiler, although I guess the libm
>>> version is what matters the most here).
>>>
>>> And let's not forget that there is the python wrapping cost: by doing
>>> everything in C, I got ~ 200 cycle/cos on the PIV, and ~60 cycles/cos on
>>> the core 2 duo (for double), using the rdtsc performance counter. All
>>> this for 1024 items in the array, so very optimistic usecase (everything
>>> in cache 2 if not 1).
>>>
>>> This shows that python wrapping cost is not so high, making the 100x
>>> claim a bit doubtful without more details on the way to measure speed.
>>>       
>> I appreciate all the discussion this is creating.  I wish I could work
>> on this more right now; I have a big paper deadline coming up June 1
>> that I need to focus on.
>>
>> Yes, you're reading the table right.  I should have been more clear on
>> what my implementation is doing.  It's using SIMD, so performing 4
>> cosine's at a time where a libm cosine is only doing one.  Also I don't
>> think libm trancendentals are known for being fast; I'm also likely
>> gaining performance by using a well-optimized but less accurate
>> approximation.  In fact a little more inspection shows my accuracy
>> decreases as the input values increase; I will probably need to take a
>> performance hit to fix this.
>>
>> I went and wrote code to use the libm fcos() routine instead of my cos
>> code.  Performance is equivalent to numpy, plus an overhead:
>>
>> inp sizes      1024    10240   102400  1024000  3072000
>> numpy        0.7282   9.6278 115.5976  993.5738 3017.3680
>>
>> lmcos    1   0.7594   9.7579 116.7135 1039.5783 3156.8371
>> lmcos    2   0.5274   5.7885  61.8052  537.8451 1576.2057
>> lmcos    4   0.5172   5.1240  40.5018  313.2487  791.9730
>>
>> corepy   1   0.0142   0.0880   0.9566    9.6162   28.4972
>> corepy   2   0.0342   0.0754   0.6991    6.1647   15.3545
>> corepy   4   0.0596   0.0963   0.5671    4.9499   13.8784
>>
>>
>> The times I show are in milliseconds; the system used is a dual-socket
>> dual-core 2ghz opteron.  I'm testing at the ufunc level, like this:
>>
>> def benchmark(fn, args):
>>    avgtime = 0
>>    fn(*args)
>>
>>    for i in xrange(7):
>>      t1 = time.time()
>>      fn(*args)
>>      t2 = time.time()
>>
>>      tm = t2 - t1
>>      avgtime += tm
>>
>>    return avgtime / 7
>>
>> Where fn is a ufunc, ie numpy.cos.  So I prime the execution once, then
>> do 7 timings and take the average.  I always appreciate suggestions on
>> better way to benchmark things.
>>     
>
> No, that seems good enough.  But maybe you can present results in cycles/item.  
> This is a relatively common unit and has the advantage that it does not depend 
> on the frequency of your cores.
>   

(it seems that I do not receive all emails - I never get the emails from
Andrew ?)

Concerning the timing: I think generally, you should report the minimum,
not the average. The numbers for numpy are strange: 3s to compute 3e6
cos on a 2Ghz core duo (~2000 cycles/item) is very slow. In that sense,
taking 20 cycles/item for your optimized version is much more
believable, though :)

I know the usual libm functions are not super fast, specially if high
accuracy is not needed. Music softwares and games usually go away with
approximations which are quite fast (.e.g using cos+sin evaluation at
the same time), but those are generally unacceptable for scientific
usage. I think it is critical to always check the result of your
implementation, because getting something fast but wrong can waste a lot
of your time :) One thing which may be hard to do is correct nan/inf
handling. I don't know how  SIMD extensions handle this.

cheers,

David