[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)

Sun Mar 23 16:45:37 EDT 2008

David Cournapeau wrote:
> Gnata Xavier wrote:
>   
>> Well of course my goal was not to say that my simple testcase can be 
>> copied/pasted into numpy :)
>> Of ourse it is one of the best case to use openmp.
>> Of course pragma can be more complex than that (you can tell variables 
>> that can/cannot be shared for instance).
>>
>> The size : Using openmp will be slower on small arrays, that is clear 
>> but the user doing very large computations is smart enough to know when 
>> he need to split it's job into threads. The obvious solution is to 
>> provide the user with // and non // functions.
>>     
>
> IMHO, that's a really bad solution. It should be dynamically enabled 
> (like in matlab, if I remember correctly). And this means having a plug 
> subsystem to load/unload different implementation... that is one of the 
> thing I was interested in getting done for numpy 1.1 (or above).
>
>   
I fully agree. It was only a stupid way to keep the things simple.

> For small arrays: how much slower ? Does it make the code slower than 
> without open mp ? For example, what does your code says when N is 10, 
> 100, 1000 ?
>
>   
See the code and the results at the end of this post.

>> sse : sse can help a lot but multithreading just scales where sse 
>> mono-thread based solutions don't.
>>     
>
> It depends: it scales pretty well if you use several processus, and if 
> you can design your application in a multi-process way.
>
>   
ok.

>> Build/link : It is an issue. It has to be tested. I do not know because 
>> I haven't even tried.
>>
>> So, IMHO it would be nice to try to put some OpenMP simple pragmas into 
>> numpy *only to see what is going on*.
>>
>> Even if it only work with gcc or even if...I do not know... It would be 
>> a first step. step by step :)
>>     
>
> I agree about the step by step approach; I am just not sure I agree with 
> your steps, that's all. Personally, I would first try getting a plug-in 
> system working with numpy.
I do agree with that :). Sorry I should had put that in a clear way 
before : I do agree.


>  But really, prove me wrong. Do it, try 
> putting some pragma at some places in the ufunc machinery or somewhere 
> else; as I said earlier, I would be happy to add support for open mp at 
> the build level, at least in numscons. I would love being proven wrong 
> and having a numpy which scales well with multi-core :)
>
>   
Ok I will try to see what I can do but it is sure that we do need the 
plug-in system first (read "before the threads in the numpy release"). 
During the devel of 1.1, I will try to find some time to understand 
where I should put some pragma into ufunct using a very conservation 
approach. Any people with some OpenMP knowledge are welcome because I'm 
not a OpenMP expert but only an OpenMP user in my C/C++ codes.

Here is the code (I hope it is correct)

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <sys/time.h>
#include <unistd.h>

double
accurate_time()
{
  struct timeval t;
  gettimeofday(&t,NULL);
  return (double)t.tv_sec + t.tv_usec*0.000001;
}

int
main(void)
{
  long i, j, k;
  double t1, t2;
  double *dataP;
  double *data;
  long Size = 1e8;
  long NLoops = 40;

  for(k=1; k<8; k++)
    {
      Size /= 10;
      NLoops *= 2;

      dataP = malloc(Size*sizeof(double));
      t1 = accurate_time();

#pragma omp parallel for
      for(i=0; i< Size; i++)
        {
          dataP[i]=i;
        }

      for(j=0; j<NLoops; j++)
        {
#pragma omp parallel for
          for(i=0; i<Size; i++)
            {
              dataP[i]=cos(dataP[i]);
            }
        }

      free(dataP);
      t2 = accurate_time();
      printf("%ld\t\t%ld\t%lf\t",Size,NLoops,t2-t1);

      data = malloc(Size*sizeof(double));
      t1 = accurate_time();

      for(i=0; i<Size; i++)
        {
          data[i]=i;
        }

      for(j=0; j<NLoops; j++)
        {
          for(i=0; i<Size; i++)
            {
              data[i]=cos(data[i]);
            }
        }

      free(data);
      t2 = accurate_time();
      printf("%lf\n",t2-t1);
    }

  return 0;
}

and the results :
10000000                80      10.308471       30.007250
1000000         160     1.902563        5.800172
100000          320     0.543008        1.123274
10000           640     0.206823        0.223031
1000            1280    0.088898        0.044268
100             2560    0.150429        0.008880
10              5120    0.289589        0.002084

 ---> On this machine, we should start to use threads *in this testcase* 
iif size>=10000 (a 100*100 image is a very very small one :))

Every other results are welcome :)

Cheers,
Xavier