<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jun 29, 2016 at 11:06 PM, Sturla Molden <span dir="ltr"><<a href="mailto:sturla.molden@gmail.com" target="_blank">sturla.molden@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">Ralf Gommers <<a href="mailto:ralf.gommers@gmail.com">ralf.gommers@gmail.com</a>> wrote:<br>

<br>

> For most routines performance seems to be comparable, and both are much<br>

> better than ATLAS. When there's a significant difference, I have the<br>

> impression that OpenBLAS is more often the slower one (example:<br>

</span>> <a<br>

> href="<a href="https://github.com/xianyi/OpenBLAS/issues/533" rel="noreferrer" target="_blank">https://github.com/xianyi/OpenBLAS/issues/533</a>"><a href="https://github.com/xianyi/OpenBLAS/issues/533" rel="noreferrer" target="_blank">https://github.com/xianyi/OpenBLAS/issues/533</a></a>).<br>

<br>

Accelerate is in general better optimized for level-1 and level-2 BLAS than<br>

OpenBLAS. There are two reasons for this:<br>

<br>

First, OpenBLAS does not use AVX for these kernels, but Accelerate does.<br>

This is the more important difference. It seems the OpenBLAS devs are now<br>

working on this.<br>

<br>

Second, the thread pool in OpenBLAS is not as scalable on small tasks as<br>

the "Grand Central Dispatch" (GCD) used by Accelerate. The GCD thread-pool<br>

used by Accelerate is actually quite unique in having a very tiny overhead:<br>

It takes only 16 extra opcodes (IIRC) for running a task on the global<br>

parallel queue instead of the current thread. (Even if my memory is not<br>

perfect and it is not exactly 16 opcodes, it is within that order of<br>

magnitude.) GCD can do this because the global queues and threadpool is<br>

actually built into the kernel of the OS. On the other hand, OpenBLAS and<br>

MKL depends on thread pools managed in userspace, for which the scheduler<br>

in the OS have no special knowledge. When you need fine-grained parallelism<br>

and synchronization, there is nothing like GCD. Even a user-space spinlock<br>

will have bigger overhead than a sequential queue in GCD. With a userspace<br>

threadpool all threads are scheduled on a round robin basis, but with GCD<br>

the scheduler has special knowledge about the tasks put on the queues, and<br>

executes them as fast as possible. Accelerate therefore has an unique<br>

advantage when running level-1 and 2 BLAS routines, with which OpenBLAS or<br>

MKL probably never can properly compete. Programming with GCD can actually<br>

often be counter-intuitive to someone used to deal with OpenMP, MPI or<br>

pthreads. For example it is often better to enqueue a lot of small tasks<br>

instead of splitting up the computation into large chunks of work. When<br>

parallelising a tight loop, a chunk size of 1 can be great on GCD but is<br>

likely to be horrible on OpenMP and anything else that has userspace<br>

threads.<br></blockquote><div><br></div><div>Thanks Sturla, interesting details as always. You didn't state your preference by the way, do you have one?<br><br></div><div>We're building binaries for the average user, so I'd say the AVX thing is of relevance for the decision to be made, the GCD one less so (people who care about that will not have any trouble building their own numpy). <br></div></div><br></div><div class="gmail_extra">So far the score is: one +1, one +0.5, one +0, one -1 and one "still a bit nervous". Any other takers?<br><br></div><div class="gmail_extra">Ralf<br><br></div></div>