[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Sun Mar 23 07:03:51 EDT 2008

Travis E. Oliphant wrote:
> Anne Archibald wrote:
>   
>> On 22/03/2008, Travis E. Oliphant <oliphant at enthought.com> wrote:
>>   
>>     
>>> James Philbin wrote:
>>>  > Personally, I think that the time would be better spent optimizing
>>>  > routines for single-threaded code and relying on BLAS and LAPACK
>>>  > libraries to use multiple cores for more complex calculations. In
>>>  > particular, doing some basic loop unrolling and SSE versions of the
>>>  > ufuncs would be beneficial. I have some experience writing SSE code
>>>  > using intrinsics and would be happy to give it a shot if people tell
>>>  > me what functions I should focus on.
>>>
>>> Fabulous!   This is on my Project List of todo items for NumPy.  See
>>>  http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
>>>  some time refactoring the ufunc loops so that the templating does not
>>>  get in the way of doing this on a case by case basis.
>>>
>>>  1) You should focus on the math operations:  add, subtract, multiply,
>>>  divide, and so forth.
>>>  2) Then for "combined operations" we should expose the functionality at
>>>  a high-level.  So, that somebody could write code to take advantage of it.
>>>
>>>  It would be easiest to use intrinsics which would then work for AMD,
>>>  Intel, on multiple compilers.
>>>     
>>>       
>> I think even heavier use of code generation would be a good idea here.
>> There are so many different versions of each loop, and the fastest way
>> to run each one is going to be different for different versions and
>> different platforms, that a routine that assembled the code from
>> chunks and picked the fastest combination for each instance might make
>> a big difference - this is roughly what FFTW and ATLAS do.
>>
>> There are also some optimizations to be made at a higher level that
>> might give these optimizations more traction. For example:
>>
>> A = randn(100*100)
>> A.shape = (100,100)
>> A*A
>>
>> There's no reason the multiply ufunc couldn't flatten A and use a
>> single unstrided loop to do the multiplication.
>>   
>>     
> Good idea,  it does already do that :-)  The ufunc machinery is also a 
> good place for an optional thread pool.
>
> Perhaps we could drum up interest in a Need for Speed Sprint on NumPy 
> sometime over the next few months.
>
>
> -Travis O.
>   

Hi,

I have a very limited knowledge of openmp but please consider this 
testcase :

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

#define N 100000000

int
main(void)
{
  double *data;
  data = malloc(N*sizeof(double));
  long i;
  #pragma omp parallel for
  for(i=0;i<N;i++)
    {
      data[i]=i;
    }

  long j;

  for(j=0;j<4;j++)
    {
      #pragma omp parallel for
      for(i=0;i<N;i++)
        {
          data[i]=cos(data[i]);
        }
    }
  return 0;
}

gcc -fopenmp -Wall -lm -O3  sin.c -o sinopenmp and gcc -fopenmp -Wall 
-lm -O3  sin.c -o sin

On my core2duo :

time ./sin

real    0m15.910s
user    0m15.249s
sys     0m0.646s

and

time ./sinopenmp

real    0m8.699s
user    0m16.287s
sys     0m0.893s

It scales very well :) (gcc-4.2). It would be so nice to see that usign 
numpy.sin(a)
Ok it is a very simple case but numpy.sin(a) is such a case isn't it ??

Please give it a try ;)

Xavier