[Numpy-discussion] Accelerate or OpenBLAS for numpy / scipy wheels?

Wed Jun 29 17:06:54 EDT 2016

Ralf Gommers <ralf.gommers at gmail.com> wrote:

> For most routines performance seems to be comparable, and both are much
> better than ATLAS. When there's a significant difference, I have the
> impression that OpenBLAS is more often the slower one (example:
> <a
> href="https://github.com/xianyi/OpenBLAS/issues/533">https://github.com/xianyi/OpenBLAS/issues/533</a>).

Accelerate is in general better optimized for level-1 and level-2 BLAS than
OpenBLAS. There are two reasons for this: 

First, OpenBLAS does not use AVX for these kernels, but Accelerate does.
This is the more important difference. It seems the OpenBLAS devs are now
working on this.

Second, the thread pool in OpenBLAS is not as scalable on small tasks as
the "Grand Central Dispatch" (GCD) used by Accelerate. The GCD thread-pool
used by Accelerate is actually quite unique in having a very tiny overhead:
It takes only 16 extra opcodes (IIRC) for running a task on the global
parallel queue instead of the current thread. (Even if my memory is not
perfect and it is not exactly 16 opcodes, it is within that order of
magnitude.) GCD can do this because the global queues and threadpool is
actually built into the kernel of the OS. On the other hand, OpenBLAS and
MKL depends on thread pools managed in userspace, for which the scheduler
in the OS have no special knowledge. When you need fine-grained parallelism
and synchronization, there is nothing like GCD. Even a user-space spinlock
will have bigger overhead than a sequential queue in GCD. With a userspace
threadpool all threads are scheduled on a round robin basis, but with GCD
the scheduler has special knowledge about the tasks put on the queues, and
executes them as fast as possible. Accelerate therefore has an unique
advantage when running level-1 and 2 BLAS routines, with which OpenBLAS or
MKL probably never can properly compete. Programming with GCD can actually
often be counter-intuitive to someone used to deal with OpenMP, MPI or
pthreads. For example it is often better to enqueue a lot of small tasks
instead of splitting up the computation into large chunks of work. When
parallelising a tight loop, a chunk size of 1 can be great on GCD but is
likely to be horrible on OpenMP and anything else that has userspace
threads.

Sturla