[Numpy-discussion] Cython-based OpenMP-accelerated quartic polynomial solver

Wed Sep 30 18:04:50 EDT 2015

On 30/09/15 11:27, Daπid wrote:

> Is there a nice way to ship both versions? After all, most
> implementations of BLAS and friends do spawn OpenMP threads, so I don't
> think it would be outrageous to take advantage of it in more places;

Some do, others don't.

ACML uses OpenMP.

GotoBLAS uses OpenMP.

Intel MKL uses Intel TBB and OpenMP (both of them).

OpenBLAS will by default use an internal threadpool. It can be 
configured to use OpenMP instead.

ATLAS uses its own threadpool.

Apple Accelerate Framework uses a kernel thread-pool called the Grand 
Central Dispatch (GCD). A library called libdispatch uses kqueue to 
access the GCD.

There are two principal problems with using OpenMP in NumPy:

One is that the GNU OpenMP threadpool is not fork-safe, and can break 
multiprocessing on some platforms (e.g. when using Python 2.7 on Linux). 
Anything that uses GCD has this nasty effect on Apple and FreeBSD as 
well. Note that the problem is actually in multiprocessing. It is not 
present on Windows (there is no fork system call) and it is avoidable 
even on Linux with Python 3.4 or later. Also the default builds of NumPy 
and SciPy on MacOSX uses GCD by default.

The other problem is that NumPy with its iterator objects and gufuncs is 
not particularly fit for multithreading. There will be a lot of 
contention for the iterator object. Without a major redesign, one thread 
would do useful work while most of the others would be busy-waiting on a 
spinlock and fighting for iterator access. Not to mention that the 
iterator object would be false shared between the processors, which 
would trash the performance if you have more than one CPU, even when 
compared to using a single thread. This means that for multithreading to 
be useful in NumPy, the loops will have to be redesigned so that the 
work sharing between the threads is taken care of ahead of creating the 
iterator, i.e. allowing us to use one iterator per thread. Each iterator 
would then iterate over a slice of the original array. This is ok, but 
we cannot simply do this by adding OpenMP pragmas in the C code.

Given the two problems mentioned above, it would likely be better to use 
a fork-safe threadpool instead of OpenMP.

Sturla