Re: [Numpy-discussion] numpy ufuncs and COREPY - any info?

May 26, 2009

      A Tuesday 26 May 2009 03:11:56 David Cournapeau escrigué:
...
Charles R Harris wrote:
...
On Mon, May 25, 2009 at 4:59 AM, Andrew Friedley <afriedle@indiana.edu
<mailto:afriedle@indiana.edu>> wrote:
For some reason the list seems to occasionally drop my messages...
Francesc Alted wrote:
    > A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué:
    >> I'm the student doing the project.  I have a blog here, which
contains
>> some initial performance numbers for a couple test ufuncs I did:
    >>
    >> http://numcorepy.blogspot.com
    >>
    >> Another alternative we've talked about, and I (more and more
likely) may
>> look into is composing multiple operations together into a
single ufunc.
>>   Again the main idea being that memory accesses can be
reduced/eliminated.
> IMHO, composing multiple operations together is the most
promising venue for
> leveraging current multicore systems.
Agreed -- our concern when considering for the project was to keep
the scope reasonable so I can complete it in the GSoC timeframe.  If I
have
    time I'll definitely be looking into this over the summer; if not
    later.
> Another interesting approach is to implement costly operations
(from the point
> of view of CPU resources), namely, transcendental functions like
sin, cos or
> tan, but also others like sqrt or pow) in a parallel way.  If
besides, you can
> combine this with vectorized versions of them (by using the well
spread SSE2
> instruction set, see [1] for an example), then you would be able
to achieve
> really good results for sure (at least Intel did with its VML
library ;)
> [1] http://gruntthepeon.free.fr/ssemath/
I've seen that page before.  Using another source [1] I came up with
a quick/dirty cos ufunc.  Performance is crazy good compared to NumPy
(100x); see the latest post on my blog for a little more info.  I'll look
at the source myself when I get time again, but is NumPy using a
Python-based cos function, a C implementation, or something else? As I
    wrote in my blog, the performance gain is almost too good to believe.
Numpy uses the C library version. If long double and float aren't
available the double version is used with number conversions, but that
shouldn't give a factor of 100x. Something else is going on.
I think something is wrong with the measurement method - on my machine,
computing the cos of an array of double takes roughly ~400 cycles/item
for arrays with a reasonable size (> 1e3 items). Taking 4 cycles/item
for cos would be very impressive :)
Well, it is Andrew who should demonstrate that his measurement is correct, but 
in principle, 4 cycles/item *should* be feasible when using 8 cores in 
parallel.  In [1] one can see how Intel achieves (with his VML kernel) to 
compute a cos() in less than 23 cycles in one single core.  Having 8 cores in 
parallel would allow, in theory, reach 3 cycles/item.

[1]http://www.intel.com/software/products/mkl/data/vml/functions/_performanceal...

-- 
Francesc Alted