[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Travis E. Oliphant oliphant at enthought.com
Sat Mar 22 15:16:31 EDT 2008


Anne Archibald wrote:
> On 22/03/2008, Travis E. Oliphant <oliphant at enthought.com> wrote:
>   
>> James Philbin wrote:
>>  > Personally, I think that the time would be better spent optimizing
>>  > routines for single-threaded code and relying on BLAS and LAPACK
>>  > libraries to use multiple cores for more complex calculations. In
>>  > particular, doing some basic loop unrolling and SSE versions of the
>>  > ufuncs would be beneficial. I have some experience writing SSE code
>>  > using intrinsics and would be happy to give it a shot if people tell
>>  > me what functions I should focus on.
>>
>> Fabulous!   This is on my Project List of todo items for NumPy.  See
>>  http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
>>  some time refactoring the ufunc loops so that the templating does not
>>  get in the way of doing this on a case by case basis.
>>
>>  1) You should focus on the math operations:  add, subtract, multiply,
>>  divide, and so forth.
>>  2) Then for "combined operations" we should expose the functionality at
>>  a high-level.  So, that somebody could write code to take advantage of it.
>>
>>  It would be easiest to use intrinsics which would then work for AMD,
>>  Intel, on multiple compilers.
>>     
>
> I think even heavier use of code generation would be a good idea here.
> There are so many different versions of each loop, and the fastest way
> to run each one is going to be different for different versions and
> different platforms, that a routine that assembled the code from
> chunks and picked the fastest combination for each instance might make
> a big difference - this is roughly what FFTW and ATLAS do.
>
> There are also some optimizations to be made at a higher level that
> might give these optimizations more traction. For example:
>
> A = randn(100*100)
> A.shape = (100,100)
> A*A
>
> There's no reason the multiply ufunc couldn't flatten A and use a
> single unstrided loop to do the multiplication.
>   
Good idea,  it does already do that :-)  The ufunc machinery is also a 
good place for an optional thread pool.

Perhaps we could drum up interest in a Need for Speed Sprint on NumPy 
sometime over the next few months.


-Travis O.




More information about the NumPy-Discussion mailing list