[Numpy-discussion] Openmp support (was numpy's future (1.1 and beyond): which direction(s) ?)
Gnata Xavier
xavier.gnata at gmail.com
Sun Mar 23 07:03:51 EDT 2008
Travis E. Oliphant wrote:
> Anne Archibald wrote:
>
>> On 22/03/2008, Travis E. Oliphant <oliphant at enthought.com> wrote:
>>
>>
>>> James Philbin wrote:
>>> > Personally, I think that the time would be better spent optimizing
>>> > routines for single-threaded code and relying on BLAS and LAPACK
>>> > libraries to use multiple cores for more complex calculations. In
>>> > particular, doing some basic loop unrolling and SSE versions of the
>>> > ufuncs would be beneficial. I have some experience writing SSE code
>>> > using intrinsics and would be happy to give it a shot if people tell
>>> > me what functions I should focus on.
>>>
>>> Fabulous! This is on my Project List of todo items for NumPy. See
>>> http://projects.scipy.org/scipy/numpy/wiki/ProjectIdeas I should spend
>>> some time refactoring the ufunc loops so that the templating does not
>>> get in the way of doing this on a case by case basis.
>>>
>>> 1) You should focus on the math operations: add, subtract, multiply,
>>> divide, and so forth.
>>> 2) Then for "combined operations" we should expose the functionality at
>>> a high-level. So, that somebody could write code to take advantage of it.
>>>
>>> It would be easiest to use intrinsics which would then work for AMD,
>>> Intel, on multiple compilers.
>>>
>>>
>> I think even heavier use of code generation would be a good idea here.
>> There are so many different versions of each loop, and the fastest way
>> to run each one is going to be different for different versions and
>> different platforms, that a routine that assembled the code from
>> chunks and picked the fastest combination for each instance might make
>> a big difference - this is roughly what FFTW and ATLAS do.
>>
>> There are also some optimizations to be made at a higher level that
>> might give these optimizations more traction. For example:
>>
>> A = randn(100*100)
>> A.shape = (100,100)
>> A*A
>>
>> There's no reason the multiply ufunc couldn't flatten A and use a
>> single unstrided loop to do the multiplication.
>>
>>
> Good idea, it does already do that :-) The ufunc machinery is also a
> good place for an optional thread pool.
>
> Perhaps we could drum up interest in a Need for Speed Sprint on NumPy
> sometime over the next few months.
>
>
> -Travis O.
>
Hi,
I have a very limited knowledge of openmp but please consider this
testcase :
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>
#define N 100000000
int
main(void)
{
double *data;
data = malloc(N*sizeof(double));
long i;
#pragma omp parallel for
for(i=0;i<N;i++)
{
data[i]=i;
}
long j;
for(j=0;j<4;j++)
{
#pragma omp parallel for
for(i=0;i<N;i++)
{
data[i]=cos(data[i]);
}
}
return 0;
}
gcc -fopenmp -Wall -lm -O3 sin.c -o sinopenmp and gcc -fopenmp -Wall
-lm -O3 sin.c -o sin
On my core2duo :
time ./sin
real 0m15.910s
user 0m15.249s
sys 0m0.646s
and
time ./sinopenmp
real 0m8.699s
user 0m16.287s
sys 0m0.893s
It scales very well :) (gcc-4.2). It would be so nice to see that usign
numpy.sin(a)
Ok it is a very simple case but numpy.sin(a) is such a case isn't it ??
Please give it a try ;)
Xavier
More information about the NumPy-Discussion
mailing list