[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Gnata Xavier
xavier.gnata at gmail.com
Sun Mar 23 16:45:37 EDT 2008
David Cournapeau wrote:
> Gnata Xavier wrote:
>
>> Well of course my goal was not to say that my simple testcase can be
>> copied/pasted into numpy :)
>> Of ourse it is one of the best case to use openmp.
>> Of course pragma can be more complex than that (you can tell variables
>> that can/cannot be shared for instance).
>>
>> The size : Using openmp will be slower on small arrays, that is clear
>> but the user doing very large computations is smart enough to know when
>> he need to split it's job into threads. The obvious solution is to
>> provide the user with // and non // functions.
>>
>
> IMHO, that's a really bad solution. It should be dynamically enabled
> (like in matlab, if I remember correctly). And this means having a plug
> subsystem to load/unload different implementation... that is one of the
> thing I was interested in getting done for numpy 1.1 (or above).
>
>
I fully agree. It was only a stupid way to keep the things simple.
> For small arrays: how much slower ? Does it make the code slower than
> without open mp ? For example, what does your code says when N is 10,
> 100, 1000 ?
>
>
See the code and the results at the end of this post.
>> sse : sse can help a lot but multithreading just scales where sse
>> mono-thread based solutions don't.
>>
>
> It depends: it scales pretty well if you use several processus, and if
> you can design your application in a multi-process way.
>
>
ok.
>> Build/link : It is an issue. It has to be tested. I do not know because
>> I haven't even tried.
>>
>> So, IMHO it would be nice to try to put some OpenMP simple pragmas into
>> numpy *only to see what is going on*.
>>
>> Even if it only work with gcc or even if...I do not know... It would be
>> a first step. step by step :)
>>
>
> I agree about the step by step approach; I am just not sure I agree with
> your steps, that's all. Personally, I would first try getting a plug-in
> system working with numpy.
I do agree with that :). Sorry I should had put that in a clear way
before : I do agree.
> But really, prove me wrong. Do it, try
> putting some pragma at some places in the ufunc machinery or somewhere
> else; as I said earlier, I would be happy to add support for open mp at
> the build level, at least in numscons. I would love being proven wrong
> and having a numpy which scales well with multi-core :)
>
>
Ok I will try to see what I can do but it is sure that we do need the
plug-in system first (read "before the threads in the numpy release").
During the devel of 1.1, I will try to find some time to understand
where I should put some pragma into ufunct using a very conservation
approach. Any people with some OpenMP knowledge are welcome because I'm
not a OpenMP expert but only an OpenMP user in my C/C++ codes.
Here is the code (I hope it is correct)
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#include <sys/time.h>
#include <unistd.h>
double
accurate_time()
{
struct timeval t;
gettimeofday(&t,NULL);
return (double)t.tv_sec + t.tv_usec*0.000001;
}
int
main(void)
{
long i, j, k;
double t1, t2;
double *dataP;
double *data;
long Size = 1e8;
long NLoops = 40;
for(k=1; k<8; k++)
{
Size /= 10;
NLoops *= 2;
dataP = malloc(Size*sizeof(double));
t1 = accurate_time();
#pragma omp parallel for
for(i=0; i< Size; i++)
{
dataP[i]=i;
}
for(j=0; j<NLoops; j++)
{
#pragma omp parallel for
for(i=0; i<Size; i++)
{
dataP[i]=cos(dataP[i]);
}
}
free(dataP);
t2 = accurate_time();
printf("%ld\t\t%ld\t%lf\t",Size,NLoops,t2-t1);
data = malloc(Size*sizeof(double));
t1 = accurate_time();
for(i=0; i<Size; i++)
{
data[i]=i;
}
for(j=0; j<NLoops; j++)
{
for(i=0; i<Size; i++)
{
data[i]=cos(data[i]);
}
}
free(data);
t2 = accurate_time();
printf("%lf\n",t2-t1);
}
return 0;
}
and the results :
10000000 80 10.308471 30.007250
1000000 160 1.902563 5.800172
100000 320 0.543008 1.123274
10000 640 0.206823 0.223031
1000 1280 0.088898 0.044268
100 2560 0.150429 0.008880
10 5120 0.289589 0.002084
---> On this machine, we should start to use threads *in this testcase*
iif size>=10000 (a 100*100 image is a very very small one :))
Every other results are welcome :)
Cheers,
Xavier
More information about the NumPy-Discussion
mailing list