[Numpy-discussion] numpy ufuncs and COREPY - any info?

Tue May 26 02:56:25 EDT 2009

A Tuesday 26 May 2009 03:11:56 David Cournapeau escrigué:
> Charles R Harris wrote:
> > On Mon, May 25, 2009 at 4:59 AM, Andrew Friedley <afriedle at indiana.edu
> > <mailto:afriedle at indiana.edu>> wrote:
> >
> >     For some reason the list seems to occasionally drop my messages...
> >
> >     Francesc Alted wrote:
> >     > A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué:
> >     >> I'm the student doing the project.  I have a blog here, which
> >
> >     contains
> >
> >     >> some initial performance numbers for a couple test ufuncs I did:
> >     >>
> >     >> http://numcorepy.blogspot.com
> >     >>
> >     >> Another alternative we've talked about, and I (more and more
> >
> >     likely) may
> >
> >     >> look into is composing multiple operations together into a
> >
> >     single ufunc.
> >
> >     >>   Again the main idea being that memory accesses can be
> >
> >     reduced/eliminated.
> >
> >     > IMHO, composing multiple operations together is the most
> >
> >     promising venue for
> >
> >     > leveraging current multicore systems.
> >
> >     Agreed -- our concern when considering for the project was to keep
> > the scope reasonable so I can complete it in the GSoC timeframe.  If I
> > have
> >     time I'll definitely be looking into this over the summer; if not
> >     later.
> >
> >     > Another interesting approach is to implement costly operations
> >
> >     (from the point
> >
> >     > of view of CPU resources), namely, transcendental functions like
> >
> >     sin, cos or
> >
> >     > tan, but also others like sqrt or pow) in a parallel way.  If
> >
> >     besides, you can
> >
> >     > combine this with vectorized versions of them (by using the well
> >
> >     spread SSE2
> >
> >     > instruction set, see [1] for an example), then you would be able
> >
> >     to achieve
> >
> >     > really good results for sure (at least Intel did with its VML
> >
> >     library ;)
> >
> >     > [1] http://gruntthepeon.free.fr/ssemath/
> >
> >     I've seen that page before.  Using another source [1] I came up with
> > a quick/dirty cos ufunc.  Performance is crazy good compared to NumPy
> > (100x); see the latest post on my blog for a little more info.  I'll look
> > at the source myself when I get time again, but is NumPy using a
> > Python-based cos function, a C implementation, or something else? As I
> >     wrote in my blog, the performance gain is almost too good to believe.
> >
> >
> > Numpy uses the C library version. If long double and float aren't
> > available the double version is used with number conversions, but that
> > shouldn't give a factor of 100x. Something else is going on.
>
> I think something is wrong with the measurement method - on my machine,
> computing the cos of an array of double takes roughly ~400 cycles/item
> for arrays with a reasonable size (> 1e3 items). Taking 4 cycles/item
> for cos would be very impressive :)

Well, it is Andrew who should demonstrate that his measurement is correct, but 
in principle, 4 cycles/item *should* be feasible when using 8 cores in 
parallel.  In [1] one can see how Intel achieves (with his VML kernel) to 
compute a cos() in less than 23 cycles in one single core.  Having 8 cores in 
parallel would allow, in theory, reach 3 cycles/item.

[1]http://www.intel.com/software/products/mkl/data/vml/functions/_performanceall.html

-- 
Francesc Alted