[Numpy-discussion] NumPy re-factoring project

Thu Jun 10 17:56:56 EDT 2010

Den 10.06.2010 22:28, skrev Pauli Virtanen:
>
> Some places where Openmp could probably help are in the inner ufunc
> loops. However, improving the memory efficiency of the data access
> pattern is another low-hanging fruit for multidimensional arrays.
>
>    

Getting the intermediate array out of expressions like a + b would be 
very cool. But it is not as trivial as it seems.

Possibility of controlling aliasing (cf. Fortran and C99) can help. For 
example, for binary array operators we know that the output array (a 
temporary intermediate) is not aliased with the two arguments.

ufuncs involving trancendental functions would benefit from OpenMP. 
O(N**2) ufuncs like var and std would also benefit.

We can also use copy-in copy-out for strided array access (see below).

> I'm not sure how Openmp plays together with thread-safety; especially if
> used in outer constructs. Whether this is possible to do easily is not so
> clear, as Numpy uses e.g. iterator and loop objects, which cannot
> probably be shared between different openmp threads. So one would have to
> do some extra work to parallelize these parts manually.
>
>    

OpenMP has pragmas for serializing access to code, which is:

#pragma omp critical
{
      // done by all threads
      // serialized access (we can also name the lock)
}

#pragma omp atomic
a += 1 // atomic operation (e.g. nice for refcounts, instead a big lock)

#pragma omp single
{
      // done by one thread (random or named)
}

#pragma omp master
{
      // done by the master thread only
      // e.g. used when calling back to the Python C API
}

#pragma omp barrier
// wait for all threads here

That's about all you need to know about thread synchronization in OpenMP...

> But probably if multithreading is desired, openmp sounds like the way to
> go...
>
>    

Anyone profient in C or Fortran can learn OpenMP in 10 minutes, and it 
is close to infinitely more easy than manual multithreading. And now it 
is supported by the majority of compilers.

Also about array iterators in NumPy's C base (i.e. for doing something 
along an axis): we don't need those. There is a different way of coding 
which leads to faster code.

1. Collect an array of pointers to each subarray (e.g. using 
std::vector<dtype*> or dtype**)
2. Dispatch on the pointer array...

I used this to reimplement lfilter in scipy. It is just a lot faster 
than array iterators. It also allows us to use OpenMP in a very natural 
way. For multidimensional array I have a recursive function for 
collecting the array of subarray pointers, but for 2D and 3D with can 
use nested for-loops.

Another thing I did when reimplementing lfilter was "copy-in copy-out" 
for strided arrays. This also helps a lot. It might seem paradoxical as 
we need to iterate over the strided array(s) to make the copies. But in 
practice it is faster by a factor of 2 or 3. Fortran compilers also tend 
to do this optimization for strided arrays. It is even mentioned in the 
Fortran standard as explicitely allowed, as it has consequences for 
array pointers (they can be invalidated by function calls if a 
contigiuous copy is made).

Sturla