Consider linking to Accelerate in Apple Silicon ARM64 wheels
Hello! I recently got a new MacBook Pro with an M2 Pro CPU (ARM64). When I ran some numerical computations (ICA to be precise), I was surprised how slow it was  way slower than e.g. my almost 10 year old Intel Mac. It turns out that the default OpenBLAS, which is what you get when installing the binary wheel with pip (i.e. "pip install numpy"), is the reason why computations are so slow. When installing NumPy from source (by using "pip install nobinary :all: nousepep517 numpy"), it uses the Appleprovided Accelerate framework, which includes an optimized BLAS library. The difference is mindboggling, I'd even say that NumPy is pretty much unusable with the default OpenBLAS backend (at least for the applications I tested). In my test with four different ICA algorithms, I got these runtimes with the default OpenBLAS:  FastICA: 6.3s  Picard: 26.3s  Infomax: 0.8s  Extended Infomax: 1.4s Especially the second algorithm is way slower than on my 10 year old Intel Mac using OpenBLAS. Here are the times with Accelerate:  FastICA: 0.4s  Picard: 0.6s  Infomax: 1.0s  Extended Infomax: 1.3s Given this huge performance difference, my question is if you would consider distributing a binary wheel for ARM64based Macs which links to Accelerate. Or are there any caveats why you do not want to do that? I know that NumPy moved away from Accelerate years ago on Intel Macs, but maybe now is the time to reconsider this decision. Thanks! Clemens
On Thu, Mar 23, 2023 at 1:55 AM Clemens Brunner <clemens.brunner@gmail.com> wrote:
Hello!
I recently got a new MacBook Pro with an M2 Pro CPU (ARM64). When I ran some numerical computations (ICA to be precise), I was surprised how slow it was  way slower than e.g. my almost 10 year old Intel Mac. It turns out that the default OpenBLAS, which is what you get when installing the binary wheel with pip (i.e. "pip install numpy"), is the reason why computations are so slow.
When installing NumPy from source (by using "pip install nobinary :all: nousepep517 numpy"), it uses the Appleprovided Accelerate framework, which includes an optimized BLAS library. The difference is mindboggling, I'd even say that NumPy is pretty much unusable with the default OpenBLAS backend (at least for the applications I tested).
In my test with four different ICA algorithms, I got these runtimes with the default OpenBLAS:
 FastICA: 6.3s  Picard: 26.3s  Infomax: 0.8s  Extended Infomax: 1.4s
Especially the second algorithm is way slower than on my 10 year old Intel Mac using OpenBLAS.
Here are the times with Accelerate:
 FastICA: 0.4s  Picard: 0.6s  Infomax: 1.0s  Extended Infomax: 1.3s
Given this huge performance difference, my question is if you would consider distributing a binary wheel for ARM64based Macs which links to Accelerate. Or are there any caveats why you do not want to do that? I know that NumPy moved away from Accelerate years ago on Intel Macs, but maybe now is the time to reconsider this decision.
Hi Clemens, thanks for the suggestion and benchmarks. We actually discussed this in the last community meeting. Accelerate as of today is supported when building from source, and that will use 32bit BLAS/LAPACK (the LP64 interface). Since NumPy 1.22 we're shipping our wheels with the 64bit (ILP64) interface, which Accelerate doesn't provide. That's about to change though, in macOS 13.3: https://developer.apple.com/documentation/macosreleasenotes/macos13_3rel.... That release will also upgrade to LAPACK 3.9.1, which means we can reenable it for SciPy too. For macOS 14 we will most likely, if things go well, ship wheels linked against the new ILP64 Accelerate build. Due to packaging limitations (the `packaging` library ignores minor versions of macOS), we can't ship wheels for >=13.3, only >=14. Cheers, Ralf
Thanks Ralf, this sounds great! Just making sure I understand, this means that for macOS 13, we have to enable Accelerate by building NumPy from source. Once macOS 13.3 is out, building SciPy from source will also link to Accelerate. Finally, Accelerate support will be enabled by default in binary wheels for NumPy and SciPy once macOS 14 is out (presumably some time in October this year). Correct? Clemens
On Thu, Mar 23, 2023 at 10:43 AM Clemens Brunner <clemens.brunner@gmail.com> wrote:
Thanks Ralf, this sounds great! Just making sure I understand, this means that for macOS 13, we have to enable Accelerate by building NumPy from source.
Indeed. Either that, or use a packaging system that's more capable in this regard  condaforge for example will give you runtime switching of BLAS and LAPACK libraries out of the box ( https://condaforge.org/docs/maintainer/knowledge_base.html#switchingblasi...). Several Linux distros support this too, via mechanisms like `updatealternatives`  but that won't help you much on macOS of course. Once macOS 13.3 is out, building SciPy from source will also link to
Accelerate. Finally, Accelerate support will be enabled by default in binary wheels for NumPy and SciPy once macOS 14 is out (presumably some time in October this year). Correct?
Yes, if I had to guess now then I'd say that this will be the default in NumPy 2.0 at the end of the year. Cheers, Ralf
This sounds terrific. Is there a technique to get such a performance improvement with Accelerate? We tried: pip install numpy==1.24.3 nobinary numpy with and without a ~/.numpysite.cfg file that sets [accelerate] libraries = Accelerate, vecLib on Intel and M2 Macs on macOS 13.3. While np.show_config() reports that numpy is linked to Accelerate, a large dot product runs maybe twice as fast as with OpenBLAS, and our full bio simulation's performance is within measurement noise. Trying the nousepep517 argument had no obvious impact. Best, Jerry On Thu, Mar 23, 2023 at 3:52 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Thu, Mar 23, 2023 at 10:43 AM Clemens Brunner < clemens.brunner@gmail.com> wrote:
Thanks Ralf, this sounds great! Just making sure I understand, this means that for macOS 13, we have to enable Accelerate by building NumPy from source.
Indeed. Either that, or use a packaging system that's more capable in this regard  condaforge for example will give you runtime switching of BLAS and LAPACK libraries out of the box ( https://condaforge.org/docs/maintainer/knowledge_base.html#switchingblasi...). Several Linux distros support this too, via mechanisms like `updatealternatives`  but that won't help you much on macOS of course.
Once macOS 13.3 is out, building SciPy from source will also link to
Accelerate. Finally, Accelerate support will be enabled by default in binary wheels for NumPy and SciPy once macOS 14 is out (presumably some time in October this year). Correct?
Yes, if I had to guess now then I'd say that this will be the default in NumPy 2.0 at the end of the year.
Cheers, Ralf
On Tue, May 9, 2023 at 5:11 AM Jerry Morrison < jerry.morrison+numpy@gmail.com> wrote:
This sounds terrific. Is there a technique to get such a performance improvement with Accelerate?
We tried:
pip install numpy==1.24.3 nobinary numpy
with and without a ~/.numpysite.cfg file that sets
[accelerate] libraries = Accelerate, vecLib
on Intel and M2 Macs on macOS 13.3. While np.show_config() reports that numpy is linked to Accelerate, a large dot product runs maybe twice as fast as with OpenBLAS, and our full bio simulation's performance is within measurement noise.
Trying the nousepep517 argument had no obvious impact.
From the release notes: "To use the new interfaces, define ACCELERATE_NEW_LAPACK before including the Accelerate or vecLib headers". So you are still using the old version of the library if you did not do that. This is not entirely trivial to do now; I expect we'll see a PR to enable the new libraries quite soon. Cheers, Ralf
Best, Jerry
On Thu, Mar 23, 2023 at 3:52 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Thu, Mar 23, 2023 at 10:43 AM Clemens Brunner < clemens.brunner@gmail.com> wrote:
Thanks Ralf, this sounds great! Just making sure I understand, this means that for macOS 13, we have to enable Accelerate by building NumPy from source.
Indeed. Either that, or use a packaging system that's more capable in this regard  condaforge for example will give you runtime switching of BLAS and LAPACK libraries out of the box ( https://condaforge.org/docs/maintainer/knowledge_base.html#switchingblasi...). Several Linux distros support this too, via mechanisms like `updatealternatives`  but that won't help you much on macOS of course.
Once macOS 13.3 is out, building SciPy from source will also link to
Accelerate. Finally, Accelerate support will be enabled by default in binary wheels for NumPy and SciPy once macOS 14 is out (presumably some time in October this year). Correct?
Yes, if I had to guess now then I'd say that this will be the default in NumPy 2.0 at the end of the year.
Cheers, Ralf
_______________________________________________ NumPyDiscussion mailing list  numpydiscussion@python.org To unsubscribe send an email to numpydiscussionleave@python.org https://mail.python.org/mailman3/lists/numpydiscussion.python.org/ Member address: ralf.gommers@gmail.com
On Tue, May 9, 2023 at 12:13 PM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Tue, May 9, 2023 at 5:11 AM Jerry Morrison < jerry.morrison+numpy@gmail.com> wrote:
This sounds terrific. Is there a technique to get such a performance improvement with Accelerate?
We tried:
pip install numpy==1.24.3 nobinary numpy
with and without a ~/.numpysite.cfg file that sets
[accelerate] libraries = Accelerate, vecLib
on Intel and M2 Macs on macOS 13.3. While np.show_config() reports that numpy is linked to Accelerate, a large dot product runs maybe twice as fast as with OpenBLAS, and our full bio simulation's performance is within measurement noise.
Trying the nousepep517 argument had no obvious impact.
From the release notes: "To use the new interfaces, define ACCELERATE_NEW_LAPACK before including the Accelerate or vecLib headers". So you are still using the old version of the library if you did not do that. This is not entirely trivial to do now; I expect we'll see a PR to enable the new libraries quite soon.
... macOS Ventura 13.3 release notes. So Clemens must've tweaked the Numpy source code or build commands to do that. OK, I'll wait for this PR. Thanks, Ralf!
Cheers, Ralf
Best, Jerry
On Thu, Mar 23, 2023 at 3:52 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Thu, Mar 23, 2023 at 10:43 AM Clemens Brunner < clemens.brunner@gmail.com> wrote:
Thanks Ralf, this sounds great! Just making sure I understand, this means that for macOS 13, we have to enable Accelerate by building NumPy from source.
Indeed. Either that, or use a packaging system that's more capable in this regard  condaforge for example will give you runtime switching of BLAS and LAPACK libraries out of the box ( https://condaforge.org/docs/maintainer/knowledge_base.html#switchingblasi...). Several Linux distros support this too, via mechanisms like `updatealternatives`  but that won't help you much on macOS of course.
Once macOS 13.3 is out, building SciPy from source will also link to
Accelerate. Finally, Accelerate support will be enabled by default in binary wheels for NumPy and SciPy once macOS 14 is out (presumably some time in October this year). Correct?
Yes, if I had to guess now then I'd say that this will be the default in NumPy 2.0 at the end of the year.
Cheers, Ralf
participants (3)

Clemens Brunner

Jerry Morrison

Ralf Gommers