[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Sat Mar 22 15:54:10 EDT 2008

On Sat, Mar 22, 2008 at 1:16 PM, Travis E. Oliphant <oliphant at enthought.com>
wrote:

> Anne Archibald wrote:
> > On 22/03/2008, Travis E. Oliphant <oliphant at enthought.com> wrote:
> >
> >> James Philbin wrote:
> >>  > Personally, I think that the time would be better spent optimizing
> >>  > routines for single-threaded code and relying on BLAS and LAPACK
> >>  > libraries to use multiple cores for more complex calculations. In
> >>  > particular, doing some basic loop unrolling and SSE versions of the
> >>  > ufuncs would be beneficial. I have some experience writing SSE code
> >>  > using intrinsics and would be happy to give it a shot if people tell
> >>  > me what functions I should focus on.
> >>
> >> Fabulous!   This is on my Project List of todo items for NumPy.  See
> >>  http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
> >>  some time refactoring the ufunc loops so that the templating does not
> >>  get in the way of doing this on a case by case basis.
> >>
> >>  1) You should focus on the math operations:  add, subtract, multiply,
> >>  divide, and so forth.
> >>  2) Then for "combined operations" we should expose the functionality
> at
> >>  a high-level.  So, that somebody could write code to take advantage of
> it.
> >>
> >>  It would be easiest to use intrinsics which would then work for AMD,
> >>  Intel, on multiple compilers.
> >>
> >
> > I think even heavier use of code generation would be a good idea here.
> > There are so many different versions of each loop, and the fastest way
> > to run each one is going to be different for different versions and
> > different platforms, that a routine that assembled the code from
> > chunks and picked the fastest combination for each instance might make
> > a big difference - this is roughly what FFTW and ATLAS do.
> >
> > There are also some optimizations to be made at a higher level that
> > might give these optimizations more traction. For example:
> >
> > A = randn(100*100)
> > A.shape = (100,100)
> > A*A
> >
> > There's no reason the multiply ufunc couldn't flatten A and use a
> > single unstrided loop to do the multiplication.
> >
> Good idea,  it does already do that :-)  The ufunc machinery is also a
> good place for an optional thread pool.
>
> Perhaps we could drum up interest in a Need for Speed Sprint on NumPy
> sometime over the next few months.
>

I tend to think the first thing to do is to put together a small test
package, say with the double loops and some standard array data, and time
and profile different approaches so we don't spend a lot of time and effort
on something with little payoff. As the most immediate gains might be
through attention to the cache we might also look at some compound
operators, say multiply and add. And implementing mixed type loops might
save memory. So there are lots of things to look at.

Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20080322/49eb6f95/attachment.html>