[Numpy-discussion] numpy ufuncs and COREPY - any info?
Francesc Alted
faltet at pytables.org
Tue May 26 02:56:25 EDT 2009
A Tuesday 26 May 2009 03:11:56 David Cournapeau escrigué:
> Charles R Harris wrote:
> > On Mon, May 25, 2009 at 4:59 AM, Andrew Friedley <afriedle at indiana.edu
> > <mailto:afriedle at indiana.edu>> wrote:
> >
> > For some reason the list seems to occasionally drop my messages...
> >
> > Francesc Alted wrote:
> > > A Friday 22 May 2009 13:52:46 Andrew Friedley escrigué:
> > >> I'm the student doing the project. I have a blog here, which
> >
> > contains
> >
> > >> some initial performance numbers for a couple test ufuncs I did:
> > >>
> > >> http://numcorepy.blogspot.com
> > >>
> > >> Another alternative we've talked about, and I (more and more
> >
> > likely) may
> >
> > >> look into is composing multiple operations together into a
> >
> > single ufunc.
> >
> > >> Again the main idea being that memory accesses can be
> >
> > reduced/eliminated.
> >
> > > IMHO, composing multiple operations together is the most
> >
> > promising venue for
> >
> > > leveraging current multicore systems.
> >
> > Agreed -- our concern when considering for the project was to keep
> > the scope reasonable so I can complete it in the GSoC timeframe. If I
> > have
> > time I'll definitely be looking into this over the summer; if not
> > later.
> >
> > > Another interesting approach is to implement costly operations
> >
> > (from the point
> >
> > > of view of CPU resources), namely, transcendental functions like
> >
> > sin, cos or
> >
> > > tan, but also others like sqrt or pow) in a parallel way. If
> >
> > besides, you can
> >
> > > combine this with vectorized versions of them (by using the well
> >
> > spread SSE2
> >
> > > instruction set, see [1] for an example), then you would be able
> >
> > to achieve
> >
> > > really good results for sure (at least Intel did with its VML
> >
> > library ;)
> >
> > > [1] http://gruntthepeon.free.fr/ssemath/
> >
> > I've seen that page before. Using another source [1] I came up with
> > a quick/dirty cos ufunc. Performance is crazy good compared to NumPy
> > (100x); see the latest post on my blog for a little more info. I'll look
> > at the source myself when I get time again, but is NumPy using a
> > Python-based cos function, a C implementation, or something else? As I
> > wrote in my blog, the performance gain is almost too good to believe.
> >
> >
> > Numpy uses the C library version. If long double and float aren't
> > available the double version is used with number conversions, but that
> > shouldn't give a factor of 100x. Something else is going on.
>
> I think something is wrong with the measurement method - on my machine,
> computing the cos of an array of double takes roughly ~400 cycles/item
> for arrays with a reasonable size (> 1e3 items). Taking 4 cycles/item
> for cos would be very impressive :)
Well, it is Andrew who should demonstrate that his measurement is correct, but
in principle, 4 cycles/item *should* be feasible when using 8 cores in
parallel. In [1] one can see how Intel achieves (with his VML kernel) to
compute a cos() in less than 23 cycles in one single core. Having 8 cores in
parallel would allow, in theory, reach 3 cycles/item.
[1]http://www.intel.com/software/products/mkl/data/vml/functions/_performanceall.html
--
Francesc Alted
More information about the NumPy-Discussion
mailing list