Accelerate or OpenBLAS for numpy / scipy wheels?
![](https://secure.gravatar.com/avatar/b4929294417e9ac44c17967baae75a36.jpg?s=120&d=mm&r=g)
Hi, I just succeeded in getting an automated dual arch build of numpy and scipy, using OpenBLAS. See the last three build jobs in these two build matrices: https://travis-ci.org/matthew-brett/numpy-wheels/builds/140388119 https://travis-ci.org/matthew-brett/scipy-wheels/builds/140684673 Tests are passing on 32 and 64-bit. I didn't upload these to the usual Rackspace container at wheels.scipy.org to avoid confusion. So, I guess the question now is - should we switch to shipping OpenBLAS wheels for the next release of numpy and scipy? Or should we stick with the Accelerate framework that comes with OSX? In favor of the Accelerate build : faster to build, it's what we've been doing thus far. In favor of OpenBLAS build : allows us to commit to one BLAS / LAPACK library cross platform, when we have the Windows builds working. Faster to fix bugs with good support from main developer. No multiprocessing crashes for Python 2.7. Any thoughts? Cheers, Matthew
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On Mon, Jun 27, 2016 at 9:46 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
I'm still a bit nervous about OpenBLAS, see https://github.com/scipy/scipy/issues/6286. That was with version 0.2.18, which is pretty recent. Chuck
![](https://secure.gravatar.com/avatar/b4929294417e9ac44c17967baae75a36.jpg?s=120&d=mm&r=g)
Hi, On Tue, Jun 28, 2016 at 5:25 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
Well - we are committed to OpenBLAS already for the Linux wheels, so if that failure was due to an error in OpenBLAS, we'll have to report it and get it fixed / fix it ourselves upstream. Cheers, Matthew
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Tue, Jun 28, 2016 at 2:55 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Faster to build isn't really an argument right? Should be the same build time except for building OpenBLAS itself once per OpenBLAS version. And only applies to building wheels for releases - nothing changes for source builds done by users on OS X. If build time ever becomes a real issue, then dropping the dual arch stuff is probably the way to go - the 32-bit builds make very little sense these days. What we've been doing thus far - that is the more important argument. There's a risk in switching, we may encounter new bugs or lose some performance in particular functions.
In favor of OpenBLAS build : allows us to commit to one BLAS / LAPACK library cross platform,
This doesn't really matter too much imho, we have to support Accelerate either way.
This is probably the main reason to make the switch, if we decide to do that.
Indeed. And those wheels have been downloaded a lot already, without any issues being reported. I'm +0 on the proposal - the risk seems acceptable, but the reasons to make the switch are also not super compelling. Ralf
![](https://secure.gravatar.com/avatar/b4929294417e9ac44c17967baae75a36.jpg?s=120&d=mm&r=g)
Hi, On Tue, Jun 28, 2016 at 7:33 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
Yes, that's true, but as you know, the OSX system and Python.org Pythons are still dual arch, so technically a matching wheel should also be dual arch. I agree that we're near the point where there's near zero likelihood that the 32-bit arch will ever get exercised.
I guess I'm about +0.5 (multiprocessing, simplifying mainstream blas / lapack support) - I'm floating it now because I hadn't got the build machinery working before. Cheers, Matthew
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Tue, Jun 28, 2016 at 8:15 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
but as they say, practicality beat purity... It's not clear yet whether 3.6 will be built dual arch at this point, but in any case, no one is going to go back and change the builds on 2.7 or 3.4 or 3.5 .... But that doesn't mean we necessarily need to support dual arch downstream. Personally, I"d drop it and see if anyone screams. Though it's actually a bit tricky, at least with my knowledge to build a 64 bit only extension against the dual-arch build. At least the only way I figured out was to hack the install. ( I did this a while back when I needed a 32bit-only build -- ironic?)
This doesn't really matter too much imho, we have to support Accelerate
either way.
do we? -- so if we go OpenBlas, and someone want to do a simple build from source, what happens? Do they get accelerate? or would we ship OpenBlas source itself? or would they need to install OpenBlas some other way?
Faster to fix bugs with good support from main developer. No
multiprocessing crashes for Python 2.7.
this seems to be the compelling one. How does the performance compare? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Tue, Jun 28, 2016 at 5:50 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Indeed, unless they go through the effort of downloading a separate BLAS and LAPACK, and figuring out how to make that visible to numpy.distutils. Very few users will do that.
or would we ship OpenBlas source itself?
Definitely don't want to do that.
or would they need to install OpenBlas some other way?
Yes, or MKL, or ATLAS, or BLIS. We have support for all these, and that's a good thing. Making a uniform choice for our official binaries on various OSes doesn't reduce the need or effort for supporting those other options.
For most routines performance seems to be comparable, and both are much better than ATLAS. When there's a significant difference, I have the impression that OpenBLAS is more often the slower one (example: https://github.com/xianyi/OpenBLAS/issues/533). Ralf
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Jun 29, 2016 2:49 AM, "Andrew Jaffe" <a.h.jaffe@gmail.com> wrote:
Accelerate
Speed is important, but it's far from the only consideration, especially since differences between the top tier libraries are usually rather small. (And note that even though that bug is still listed as open, it has a link to a commit that appears to have fixed it by implementing the missing kernels.) The advantage of openblas is that it's open source, fixable, and we already focus energy on supporting it for Linux (and probably windows too soon). Accelerate is closed, so when we hit bugs then there's often nothing we can do except file a bug with apple and hope that it gets fixed within a year or two. This isn't hypothetical -- we've hit cases where accelerate gave wrong answers. Numpy actually carries some scary code right now to work around one of these bugs by monkeypatching (!) accelerate using dynamic linker trickiness. And, of course, there's the thing where accelerate totally breaks multiprocessing. Apple has said that they don't consider this a bug. Which is probably not much comfort to the new users who are getting obscure hangs when they try to use Python's most obvious and commonly recommended concurrency library. If you sum across our user base, I'm 99% sure that this means accelerate is slower than openblas on net, because you need a *lot* of code to get 10% speedups before it cancels out one person spending 3 days trying to figure out why their code is silently hanging for no reason. This probably makes me sound more negative about accelerate then I actually am -- it does work well most of the time, and obviously lots of people are using it successfully with numpy. But for our official binaries, my vote is we should switch to openblas, because these binaries are likely to be used by non-experts who are likely to hit the multiprocessing issue, and because when we're already struggling to do sufficient QA on our releases then it makes sense to focus our efforts on a single blas library. -n
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Ralf Gommers <ralf.gommers@gmail.com> wrote:
Accelerate is in general better optimized for level-1 and level-2 BLAS than OpenBLAS. There are two reasons for this: First, OpenBLAS does not use AVX for these kernels, but Accelerate does. This is the more important difference. It seems the OpenBLAS devs are now working on this. Second, the thread pool in OpenBLAS is not as scalable on small tasks as the "Grand Central Dispatch" (GCD) used by Accelerate. The GCD thread-pool used by Accelerate is actually quite unique in having a very tiny overhead: It takes only 16 extra opcodes (IIRC) for running a task on the global parallel queue instead of the current thread. (Even if my memory is not perfect and it is not exactly 16 opcodes, it is within that order of magnitude.) GCD can do this because the global queues and threadpool is actually built into the kernel of the OS. On the other hand, OpenBLAS and MKL depends on thread pools managed in userspace, for which the scheduler in the OS have no special knowledge. When you need fine-grained parallelism and synchronization, there is nothing like GCD. Even a user-space spinlock will have bigger overhead than a sequential queue in GCD. With a userspace threadpool all threads are scheduled on a round robin basis, but with GCD the scheduler has special knowledge about the tasks put on the queues, and executes them as fast as possible. Accelerate therefore has an unique advantage when running level-1 and 2 BLAS routines, with which OpenBLAS or MKL probably never can properly compete. Programming with GCD can actually often be counter-intuitive to someone used to deal with OpenMP, MPI or pthreads. For example it is often better to enqueue a lot of small tasks instead of splitting up the computation into large chunks of work. When parallelising a tight loop, a chunk size of 1 can be great on GCD but is likely to be horrible on OpenMP and anything else that has userspace threads. Sturla
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Jun 29, 2016 at 11:06 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
Thanks Sturla, interesting details as always. You didn't state your preference by the way, do you have one? We're building binaries for the average user, so I'd say the AVX thing is of relevance for the decision to be made, the GCD one less so (people who care about that will not have any trouble building their own numpy). So far the score is: one +1, one +0.5, one +0, one -1 and one "still a bit nervous". Any other takers? Ralf
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Ralf Gommers <ralf.gommers@gmail.com> wrote:
Thanks Sturla, interesting details as always. You didn't state your preference by the way, do you have one?
I use Accelerate because it is the easier for me to use when building SciPy. But that is from a developer's perspective. As you know, Accelerate breaks a common (ab)use of multiprocessing on POSIX systems. While the bug is strictly speaking in multiprocessing (but partially fixed in Python 3.4 and later), it is still a nasty surprise to many users. E.g. a call to np.dot never returns, and there is no error message indicating why. That speaks against using it in the wheels. Accelerate, like MKL and FFTW, has nifty FFTs. If we start to use MKL and Accelerate for numpy.fft (which I sometimes have fantacies about), that would shift the balance the other way, in favour of Accelerate. Speed wise Accelerate wins for things like dot product of two vectors or multiplication of a vector and a matrix. For general matrix multiplication the performance is about the same, except when matrices are very small and Accelerate can benefit from the tiny GCD overhead. But then the Python overhead probably dominates, so they are going to be about equal anyway. I am going to vote ± 0. I am really not sure which will be the better for the binary wheels. They seem about equal to me right now. There are pros and cons with either. Sturla
![](https://secure.gravatar.com/avatar/96dd777e397ab128fedab46af97a3a4a.jpg?s=120&d=mm&r=g)
On Mon, Jun 27, 2016 at 9:46 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
I'm still a bit nervous about OpenBLAS, see https://github.com/scipy/scipy/issues/6286. That was with version 0.2.18, which is pretty recent. Chuck
![](https://secure.gravatar.com/avatar/b4929294417e9ac44c17967baae75a36.jpg?s=120&d=mm&r=g)
Hi, On Tue, Jun 28, 2016 at 5:25 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
Well - we are committed to OpenBLAS already for the Linux wheels, so if that failure was due to an error in OpenBLAS, we'll have to report it and get it fixed / fix it ourselves upstream. Cheers, Matthew
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Tue, Jun 28, 2016 at 2:55 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Faster to build isn't really an argument right? Should be the same build time except for building OpenBLAS itself once per OpenBLAS version. And only applies to building wheels for releases - nothing changes for source builds done by users on OS X. If build time ever becomes a real issue, then dropping the dual arch stuff is probably the way to go - the 32-bit builds make very little sense these days. What we've been doing thus far - that is the more important argument. There's a risk in switching, we may encounter new bugs or lose some performance in particular functions.
In favor of OpenBLAS build : allows us to commit to one BLAS / LAPACK library cross platform,
This doesn't really matter too much imho, we have to support Accelerate either way.
This is probably the main reason to make the switch, if we decide to do that.
Indeed. And those wheels have been downloaded a lot already, without any issues being reported. I'm +0 on the proposal - the risk seems acceptable, but the reasons to make the switch are also not super compelling. Ralf
![](https://secure.gravatar.com/avatar/b4929294417e9ac44c17967baae75a36.jpg?s=120&d=mm&r=g)
Hi, On Tue, Jun 28, 2016 at 7:33 AM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
Yes, that's true, but as you know, the OSX system and Python.org Pythons are still dual arch, so technically a matching wheel should also be dual arch. I agree that we're near the point where there's near zero likelihood that the 32-bit arch will ever get exercised.
I guess I'm about +0.5 (multiprocessing, simplifying mainstream blas / lapack support) - I'm floating it now because I hadn't got the build machinery working before. Cheers, Matthew
![](https://secure.gravatar.com/avatar/5dde29b54a3f1b76b2541d0a4a9b232c.jpg?s=120&d=mm&r=g)
On Tue, Jun 28, 2016 at 8:15 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
but as they say, practicality beat purity... It's not clear yet whether 3.6 will be built dual arch at this point, but in any case, no one is going to go back and change the builds on 2.7 or 3.4 or 3.5 .... But that doesn't mean we necessarily need to support dual arch downstream. Personally, I"d drop it and see if anyone screams. Though it's actually a bit tricky, at least with my knowledge to build a 64 bit only extension against the dual-arch build. At least the only way I figured out was to hack the install. ( I did this a while back when I needed a 32bit-only build -- ironic?)
This doesn't really matter too much imho, we have to support Accelerate
either way.
do we? -- so if we go OpenBlas, and someone want to do a simple build from source, what happens? Do they get accelerate? or would we ship OpenBlas source itself? or would they need to install OpenBlas some other way?
Faster to fix bugs with good support from main developer. No
multiprocessing crashes for Python 2.7.
this seems to be the compelling one. How does the performance compare? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Tue, Jun 28, 2016 at 5:50 PM, Chris Barker <chris.barker@noaa.gov> wrote:
Indeed, unless they go through the effort of downloading a separate BLAS and LAPACK, and figuring out how to make that visible to numpy.distutils. Very few users will do that.
or would we ship OpenBlas source itself?
Definitely don't want to do that.
or would they need to install OpenBlas some other way?
Yes, or MKL, or ATLAS, or BLIS. We have support for all these, and that's a good thing. Making a uniform choice for our official binaries on various OSes doesn't reduce the need or effort for supporting those other options.
For most routines performance seems to be comparable, and both are much better than ATLAS. When there's a significant difference, I have the impression that OpenBLAS is more often the slower one (example: https://github.com/xianyi/OpenBLAS/issues/533). Ralf
![](https://secure.gravatar.com/avatar/97c543aca1ac7bbcfb5279d0300c8330.jpg?s=120&d=mm&r=g)
On Jun 29, 2016 2:49 AM, "Andrew Jaffe" <a.h.jaffe@gmail.com> wrote:
Accelerate
Speed is important, but it's far from the only consideration, especially since differences between the top tier libraries are usually rather small. (And note that even though that bug is still listed as open, it has a link to a commit that appears to have fixed it by implementing the missing kernels.) The advantage of openblas is that it's open source, fixable, and we already focus energy on supporting it for Linux (and probably windows too soon). Accelerate is closed, so when we hit bugs then there's often nothing we can do except file a bug with apple and hope that it gets fixed within a year or two. This isn't hypothetical -- we've hit cases where accelerate gave wrong answers. Numpy actually carries some scary code right now to work around one of these bugs by monkeypatching (!) accelerate using dynamic linker trickiness. And, of course, there's the thing where accelerate totally breaks multiprocessing. Apple has said that they don't consider this a bug. Which is probably not much comfort to the new users who are getting obscure hangs when they try to use Python's most obvious and commonly recommended concurrency library. If you sum across our user base, I'm 99% sure that this means accelerate is slower than openblas on net, because you need a *lot* of code to get 10% speedups before it cancels out one person spending 3 days trying to figure out why their code is silently hanging for no reason. This probably makes me sound more negative about accelerate then I actually am -- it does work well most of the time, and obviously lots of people are using it successfully with numpy. But for our official binaries, my vote is we should switch to openblas, because these binaries are likely to be used by non-experts who are likely to hit the multiprocessing issue, and because when we're already struggling to do sufficient QA on our releases then it makes sense to focus our efforts on a single blas library. -n
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Ralf Gommers <ralf.gommers@gmail.com> wrote:
Accelerate is in general better optimized for level-1 and level-2 BLAS than OpenBLAS. There are two reasons for this: First, OpenBLAS does not use AVX for these kernels, but Accelerate does. This is the more important difference. It seems the OpenBLAS devs are now working on this. Second, the thread pool in OpenBLAS is not as scalable on small tasks as the "Grand Central Dispatch" (GCD) used by Accelerate. The GCD thread-pool used by Accelerate is actually quite unique in having a very tiny overhead: It takes only 16 extra opcodes (IIRC) for running a task on the global parallel queue instead of the current thread. (Even if my memory is not perfect and it is not exactly 16 opcodes, it is within that order of magnitude.) GCD can do this because the global queues and threadpool is actually built into the kernel of the OS. On the other hand, OpenBLAS and MKL depends on thread pools managed in userspace, for which the scheduler in the OS have no special knowledge. When you need fine-grained parallelism and synchronization, there is nothing like GCD. Even a user-space spinlock will have bigger overhead than a sequential queue in GCD. With a userspace threadpool all threads are scheduled on a round robin basis, but with GCD the scheduler has special knowledge about the tasks put on the queues, and executes them as fast as possible. Accelerate therefore has an unique advantage when running level-1 and 2 BLAS routines, with which OpenBLAS or MKL probably never can properly compete. Programming with GCD can actually often be counter-intuitive to someone used to deal with OpenMP, MPI or pthreads. For example it is often better to enqueue a lot of small tasks instead of splitting up the computation into large chunks of work. When parallelising a tight loop, a chunk size of 1 can be great on GCD but is likely to be horrible on OpenMP and anything else that has userspace threads. Sturla
![](https://secure.gravatar.com/avatar/5f88830d19f9c83e2ddfd913496c5025.jpg?s=120&d=mm&r=g)
On Wed, Jun 29, 2016 at 11:06 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
Thanks Sturla, interesting details as always. You didn't state your preference by the way, do you have one? We're building binaries for the average user, so I'd say the AVX thing is of relevance for the decision to be made, the GCD one less so (people who care about that will not have any trouble building their own numpy). So far the score is: one +1, one +0.5, one +0, one -1 and one "still a bit nervous". Any other takers? Ralf
![](https://secure.gravatar.com/avatar/2a9d09b311f11f92cdc6a91b3c6519b1.jpg?s=120&d=mm&r=g)
Ralf Gommers <ralf.gommers@gmail.com> wrote:
Thanks Sturla, interesting details as always. You didn't state your preference by the way, do you have one?
I use Accelerate because it is the easier for me to use when building SciPy. But that is from a developer's perspective. As you know, Accelerate breaks a common (ab)use of multiprocessing on POSIX systems. While the bug is strictly speaking in multiprocessing (but partially fixed in Python 3.4 and later), it is still a nasty surprise to many users. E.g. a call to np.dot never returns, and there is no error message indicating why. That speaks against using it in the wheels. Accelerate, like MKL and FFTW, has nifty FFTs. If we start to use MKL and Accelerate for numpy.fft (which I sometimes have fantacies about), that would shift the balance the other way, in favour of Accelerate. Speed wise Accelerate wins for things like dot product of two vectors or multiplication of a vector and a matrix. For general matrix multiplication the performance is about the same, except when matrices are very small and Accelerate can benefit from the tiny GCD overhead. But then the Python overhead probably dominates, so they are going to be about equal anyway. I am going to vote ± 0. I am really not sure which will be the better for the binary wheels. They seem about equal to me right now. There are pros and cons with either. Sturla
participants (7)
-
Andrew Jaffe
-
Charles R Harris
-
Chris Barker
-
Matthew Brett
-
Nathaniel Smith
-
Ralf Gommers
-
Sturla Molden