Mailman 3 releasing wheels for arm64 (M1) macOS 12 - SciPy-Dev

newer
Ask GitHub to provide an option to...

releasing wheels for arm64 (M1) macOS 12

older
Re: Creating a (sub)submodule for...

Ralf Gommers

17 Nov 2021 17 Nov '21

5:57 a.m.

Hi all, As you may know, we still do not have native wheels up on PyPI for arm64 (M1) macOS. There were/are multiple problems, the most important one a kernel panic (i.e. OS crashes, laptop restarts): https://github.com/scipy/scipy/issues/14688. That issue is now fixed in macOS 12, which was released on 25 Oct 2021. So we can consider releasing wheels for macOS 12 only; we won't for macOS 11.x, so users will just have to upgrade - or just install from conda-forge, that has worked perfectly fine for almost a year. The main issue that is remaining is a performance issue, see the summary at https://github.com/scipy/scipy/issues/15050. Hopefully we can find the root cause of that issue. In the meantime though, I'd like to discuss what the release strategy should be here. I'll copy what I wrote in the "Can we work around the problem?" section of the issue here: If we release wheels for macOS 12, many people are going to hit this problem. A 50x slowdown for some code using linalg functionality for the default install configuration of `pip install numpy scipy` does not seem acceptable - that will lead too many users on wild goose chases. On the other hand it should be pointed out that if users build SciPy 1.7.2 from source on a native arm64 Python install, they will anyway hit the same problem. So not releasing any wheels isn't much better; at best it signals to users that they shouldn't use `arm64` just yet but stick with `x86_64` (but that does have some performance implications as well). At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading: - Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4. SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed. Rebuilding `libopenblas` with a low max number of threads does not allow users who know what they are doing or don't suffer from the problem to optimize threading behavior for their own code. It was pointed out in https://github.com/scipy/scipy/issues/14688#issuecomment-969143657 that this is undesirable. Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in `scipy/__init__.py` then that may be the most pragmatic solution right now. However, this must be done before `libopenblas` is first loaded or it won't take effect. So if users import numpy first, then setting an env var will already have no effect on that copy of `libopenblas`. It needs testing whether this then still works around the problem or not. Thoughts on which option seems best? Any other options I missed? Cheers, Ralf

Attachments:

attachment.htm (text/html — 3.7 KB)

Show replies by date

Stefan van der Walt

17 Nov 17 Nov

6:34 a.m.

On Tue, Nov 16, 2021, at 21:57, Ralf Gommers wrote:

...

Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in `scipy/__init__.py` then that may be the most pragmatic solution right now. However, this must be done before `libopenblas` is first loaded or it won't take effect. So if users import numpy first, then setting an env var will already have no effect on that copy of `libopenblas`. It needs testing whether this then still works around the problem or not.

Instead of using the environment variable, could we call `openblas_set_num_threads`? Stéfan

Matti Picus

6:49 a.m.

On 17/11/21 7:57 am, Ralf Gommers wrote:

...

Hi all,

...

At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:

- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl <https://github.com/joblib/threadpoolctl> for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4. ...

Thoughts on which option seems best? Any other options I missed?

Cheers, Ralf

There are openblas-specific utility functions like `openblas_set_num_threads` [0]. They lack documentation about which routines they affect but it might be an avenue to explore. Perhaps openblas_get_num_threads/openblas_set_num_threads could be used around the offending call like a context manager? Disadvantages: - This would affect global state. - It is not clear how to pull these functions into scipy. We tried fishing them out in CI via ctypes to check the openblas version, and failed on windows. Perhaps with a #ifdef OPENBLAS somewhere in C code? Matti [0] https://github.com/xianyi/OpenBLAS/wiki/OpenBLAS-Extensions

Ralf Gommers

6:57 a.m.

On Wed, Nov 17, 2021 at 7:49 AM Matti Picus <matti.picus@gmail.com> wrote:

...

On 17/11/21 7:57 am, Ralf Gommers wrote:

...
Hi all,

...

At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:

- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl <https://github.com/joblib/threadpoolctl> for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4. ...

Thoughts on which option seems best? Any other options I missed?

Cheers, Ralf

There are openblas-specific utility functions like `openblas_set_num_threads` [0]. They lack documentation about which routines they affect but it might be an avenue to explore. Perhaps openblas_get_num_threads/openblas_set_num_threads could be used around the offending call like a context manager? Disadvantages:

- This would affect global state.

- It is not clear how to pull these functions into scipy. We tried fishing them out in CI via ctypes to check the openblas version, and failed on windows. Perhaps with a #ifdef OPENBLAS somewhere in C code?

Good point (and thanks Stefan for making the same point at the same time), I think we can. We could do this only in this one arm64 wheel (put the code in _distributor_init.py), and use the code from threadpoolctl, something like: ``` _dynlib = ctypes.CDLL(filepath, mode=_RTLD_NOLOAD) set_func = getattr( _dynlib, "openblas_set_num_threads", # Symbols differ when built for 64bit integers in Fortran getattr(_dynlib, "openblas_set_num_threads64_", lambda num_threads: None), ) set_func(num_threads) ``` We can't get our hands on the NumPy-vendored OpenBLAS that way though (there's no guarantee it even has OpenBLAS), so it's not as comprehensive a fix as either using threadpoolctl or the user setting an env var. Cheers, Ralf

Olivier Grisel

12:56 p.m.

...

We can't get our hands on the NumPy-vendored OpenBLAS that way though (there's no guarantee it even has OpenBLAS), so it's not as comprehensive a fix as either using threadpoolctl or the user setting an env var.

But maybe this is not necessary for the specific Apple M1 problem? I can try to give it a try.

Matthew Brett

9:12 a.m.

Hi, On Wed, Nov 17, 2021 at 5:57 AM Ralf Gommers <ralf.gommers@gmail.com> wrote: [...]

...

- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

Is threadpool a heavy dependency? Am I right in thinking this is the most satisfying of the solutions, if available? If it's not heavy, and it is a good solution - that seems like the right way to go. Cheers, Matthew

Ralf Gommers

9:58 a.m.

On Wed, Nov 17, 2021 at 10:13 AM Matthew Brett <matthew.brett@gmail.com> wrote:

...

Hi,

...
- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite

On Wed, Nov 17, 2021 at 5:57 AM Ralf Gommers <ralf.gommers@gmail.com> wrote: [...] parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

Is threadpool a heavy dependency? Am I right in thinking this is the most satisfying of the solutions, if available?

Yes, I believe so. Of course it's hard to be 100% sure, because the root cause isn't well understood. If it's not heavy,

...

and it is a good solution - that seems like the right way to go.

It's not heavy, it's a small-ish pure Python library with no dependencies. It can cause problems of course because it's trying to do something nontrivial - but that's true of code we write as well. For the minimal purpose we need it (setting default number of threads for SciPy, and probably also for NumPy if it uses OpenBLAS) it should be fine. I think the main concern is simply having any new external dependency - so far NumPy has zero (on PyPI at least) and SciPy has one (namely NumPy). Any new runtime dependency inevitably is going to result in problems at some point. Which is why we avoid them, so it's not a minor decision. I'd also be hesitant to do that in 1.7.3 rather than in 1.8.0 Cheers, Ralf

Matthew Brett

10:26 a.m.

Hi, On Wed, Nov 17, 2021 at 10:00 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...

On Wed, Nov 17, 2021 at 10:13 AM Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Nov 17, 2021 at 5:57 AM Ralf Gommers <ralf.gommers@gmail.com> wrote: [...]

...
- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

Is threadpool a heavy dependency? Am I right in thinking this is the most satisfying of the solutions, if available?

Yes, I believe so. Of course it's hard to be 100% sure, because the root cause isn't well understood.

...
If it's not heavy, and it is a good solution - that seems like the right way to go.

It's not heavy, it's a small-ish pure Python library with no dependencies. It can cause problems of course because it's trying to do something nontrivial - but that's true of code we write as well. For the minimal purpose we need it (setting default number of threads for SciPy, and probably also for NumPy if it uses OpenBLAS) it should be fine.

I think the main concern is simply having any new external dependency - so far NumPy has zero (on PyPI at least) and SciPy has one (namely NumPy). Any new runtime dependency inevitably is going to result in problems at some point. Which is why we avoid them, so it's not a minor decision. I'd also be hesitant to do that in 1.7.3 rather than in 1.8.0

I understand the general reluctance - but, given the state of packaging tools - a single pure-Python package doesn't sound like a significant worry.... Cheers, Matthew

Jeremie du Boisberranger

10:35 a.m.

Hi, On 17/11/2021 11:26, Matthew Brett wrote:

...

Hi,

On Wed, Nov 17, 2021 at 10:00 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Wed, Nov 17, 2021 at 10:13 AM Matthew Brett <matthew.brett@gmail.com> wrote:

...
Hi,

On Wed, Nov 17, 2021 at 5:57 AM Ralf Gommers <ralf.gommers@gmail.com> wrote: [...]

...
- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed. Is threadpool a heavy dependency? Am I right in thinking this is the most satisfying of the solutions, if available?

Yes, I believe so. Of course it's hard to be 100% sure, because the root cause isn't well understood.

...
If it's not heavy, and it is a good solution - that seems like the right way to go.

It's not heavy, it's a small-ish pure Python library with no dependencies. It can cause problems of course because it's trying to do something nontrivial - but that's true of code we write as well. For the minimal purpose we need it (setting default number of threads for SciPy, and probably also for NumPy if it uses OpenBLAS) it should be fine.

I think the main concern is simply having any new external dependency - so far NumPy has zero (on PyPI at least) and SciPy has one (namely NumPy). Any new runtime dependency inevitably is going to result in problems at some point. Which is why we avoid them, so it's not a minor decision. I'd also be hesitant to do that in 1.7.3 rather than in 1.8.0

I understand the general reluctance - but, given the state of packaging tools - a single pure-Python package doesn't sound like a significant worry....

...

...
...
controller =

threadpoolctl is designed to solve precisely this kind of issue: threadpoolctl.ThreadpoolController().select(internal_api="openblas")

...

...
...
with controller.limit(limits=1):

... # single threaded blas call here I understand that adding a hard dependency on a package that has currently only numpy as dependency is not desirable. Would vendoring it be an alternative ? Cheers, Jérémie du Boisberranger

Ralf Gommers

10:43 a.m.

On Wed, Nov 17, 2021 at 11:36 AM Jeremie du Boisberranger < jeremie.du-boisberranger@inria.fr> wrote:

...

Hi,

...
Hi,

On Wed, Nov 17, 2021 at 10:00 AM Ralf Gommers <ralf.gommers@gmail.com> wrote:

...
On Wed, Nov 17, 2021 at 10:13 AM Matthew Brett <matthew.brett@gmail.com>

wrote:

...
...
Hi,

On Wed, Nov 17, 2021 at 5:57 AM Ralf Gommers <ralf.gommers@gmail.com> wrote: [...]

...
- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite

On 17/11/2021 11:26, Matthew Brett wrote: parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

...
...
...
Is threadpool a heavy dependency? Am I right in thinking this is the most satisfying of the solutions, if available?

Yes, I believe so. Of course it's hard to be 100% sure, because the root cause isn't well understood.

...
If it's not heavy, and it is a good solution - that seems like the right way to go.

It's not heavy, it's a small-ish pure Python library with no dependencies. It can cause problems of course because it's trying to do something nontrivial - but that's true of code we write as well. For the minimal purpose we need it (setting default number of threads for SciPy, and probably also for NumPy if it uses OpenBLAS) it should be fine.

I think the main concern is simply having any new external dependency - so far NumPy has zero (on PyPI at least) and SciPy has one (namely NumPy). Any new runtime dependency inevitably is going to result in problems at some point. Which is why we avoid them, so it's not a minor decision. I'd also be hesitant to do that in 1.7.3 rather than in 1.8.0 I understand the general reluctance - but, given the state of packaging tools - a single pure-Python package doesn't sound like a significant worry....

threadpoolctl is designed to solve precisely this kind of issue:

...
...
...
controller = threadpoolctl.ThreadpoolController().select(internal_api="openblas")

...
...
...
with controller.limit(limits=1):

... # single threaded blas call here

I understand that adding a hard dependency on a package that has currently only numpy as dependency is not desirable. Would vendoring it be an alternative ?

Thanks for the feedback Jérémie. If we vendor it, and we get two copies of threadpoolctl floating around that are both being used at the same time, will that work fine? Cheers, Ralf

Olivier Grisel

1:03 p.m.

Not 100% sure but I think it might work (would need to check). However vendoring can also be troublesome. If we discover a bug in threadpoolctl that impacts many users, we would need to have a quick coordinated re-release of threadpoolctl, numpy and scipy... This is likely to cause maintenance headaches.

Jeremie du Boisberranger

1:08 p.m.

On 17/11/2021 11:43, Ralf Gommers wrote:

...

On Wed, Nov 17, 2021 at 11:36 AM Jeremie du Boisberranger <jeremie.du-boisberranger@inria.fr <mailto:jeremie.du-boisberranger@inria.fr>> wrote:

Hi,

On 17/11/2021 11:26, Matthew Brett wrote: > Hi, > > On Wed, Nov 17, 2021 at 10:00 AM Ralf Gommers <ralf.gommers@gmail.com <mailto:ralf.gommers@gmail.com>> wrote: >> >> >> On Wed, Nov 17, 2021 at 10:13 AM Matthew Brett <matthew.brett@gmail.com <mailto:matthew.brett@gmail.com>> wrote: >>> Hi, >>> >>> On Wed, Nov 17, 2021 at 5:57 AM Ralf Gommers <ralf.gommers@gmail.com <mailto:ralf.gommers@gmail.com>> wrote: >>> [...] >>>> - Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl <https://github.com/joblib/threadpoolctl> for how) >>>> - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` >>>> - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4. >>>> >>>> SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed. >>> Is threadpool a heavy dependency? Am I right in thinking this is the >>> most satisfying of the solutions, if available? >> >> Yes, I believe so. Of course it's hard to be 100% sure, because the root cause isn't well understood. >> >>> If it's not heavy, >>> and it is a good solution - that seems like the right way to go. >> >> It's not heavy, it's a small-ish pure Python library with no dependencies. It can cause problems of course because it's trying to do something nontrivial - but that's true of code we write as well. For the minimal purpose we need it (setting default number of threads for SciPy, and probably also for NumPy if it uses OpenBLAS) it should be fine. >> >> I think the main concern is simply having any new external dependency - so far NumPy has zero (on PyPI at least) and SciPy has one (namely NumPy). Any new runtime dependency inevitably is going to result in problems at some point. Which is why we avoid them, so it's not a minor decision. I'd also be hesitant to do that in 1.7.3 rather than in 1.8.0 > I understand the general reluctance - but, given the state of > packaging tools - a single pure-Python package doesn't sound like a > significant worry....

threadpoolctl is designed to solve precisely this kind of issue:

>>> controller = threadpoolctl.ThreadpoolController().select(internal_api="openblas")

>>> with controller.limit(limits=1):

... # single threaded blas call here

I understand that adding a hard dependency on a package that has currently only numpy as dependency is not desirable. Would vendoring it be an alternative ?

Thanks for the feedback Jérémie. If we vendor it, and we get two copies of threadpoolctl floating around that are both being used at the same time, will that work fine?

It's true that it might have some unexpected behavior, in a nested setting for instance. Let's say scikit-learn sets a threadpoolctl context and call some scipy function inside that also uses threadpoolctl. It's not clear to me yet if everything will work smoothly. Moreover, we could have set the limits to 1 in sklearn to avoid oversubscription but you decided to set the limit to 4 in scipy, then we would end up with oversubscription in sklearn (this case has nothing to do with vendoring or not however). Jérémie du Boisberranger

Olivier Grisel

1:07 p.m.

...

Any new runtime dependency inevitably is going to result in problems at some point. Which is why we avoid them, so it's not a minor decision. I'd also be hesitant to do that in 1.7.3 rather than in 1.8.0

I share the feeling. But maybe this dependency and stopgap could only be added to the Apple M1 wheels (via a condition in setup.py + a `_distributor_init.py` that uses threadpoolctl only for this platform)? This way we could limit the possible negative impacts to scipy users who are already badly supported by 1.7.2. conda-forge has no such problem so I don't think this limitation of the number of OpenBLAS threads should be applied for this distribution.

Hameer Abbasi

11:12 a.m.

Hi all, I’m speculating here, but something could be going wrong with the scheduler, which schedules things on the low-power cores too often and they have to be moved the high-power cores. Would it be an option to somehow detect the number of high-power cores and limit `OPENBLAS_NUM_THREADS` to that before OpenBLAS is ever loaded? Given M1 max and M1 pro now exist, hard-coding this number would limit performance on those chips, but is definitely better than a 50x slowdown. Best regards, Hameer Abbasi

...

Am 17.11.2021 um 06:57 schrieb Ralf Gommers <ralf.gommers@gmail.com>:

Hi all,

As you may know, we still do not have native wheels up on PyPI for arm64 (M1) macOS. There were/are multiple problems, the most important one a kernel panic (i.e. OS crashes, laptop restarts): https://github.com/scipy/scipy/issues/14688 <https://github.com/scipy/scipy/issues/14688>. That issue is now fixed in macOS 12, which was released on 25 Oct 2021. So we can consider releasing wheels for macOS 12 only; we won't for macOS 11.x, so users will just have to upgrade - or just install from conda-forge, that has worked perfectly fine for almost a year.

The main issue that is remaining is a performance issue, see the summary at https://github.com/scipy/scipy/issues/15050 <https://github.com/scipy/scipy/issues/15050>. Hopefully we can find the root cause of that issue. In the meantime though, I'd like to discuss what the release strategy should be here. I'll copy what I wrote in the "Can we work around the problem?" section of the issue here:

If we release wheels for macOS 12, many people are going to hit this problem. A 50x slowdown for some code using linalg functionality for the default install configuration of `pip install numpy scipy` does not seem acceptable - that will lead too many users on wild goose chases. On the other hand it should be pointed out that if users build SciPy 1.7.2 from source on a native arm64 Python install, they will anyway hit the same problem. So not releasing any wheels isn't much better; at best it signals to users that they shouldn't use `arm64` just yet but stick with `x86_64` (but that does have some performance implications as well).

At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:

- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl <https://github.com/joblib/threadpoolctl> for how) - Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS` - Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

Rebuilding `libopenblas` with a low max number of threads does not allow users who know what they are doing or don't suffer from the problem to optimize threading behavior for their own code. It was pointed out in https://github.com/scipy/scipy/issues/14688#issuecomment-969143657 <https://github.com/scipy/scipy/issues/14688#issuecomment-969143657> that this is undesirable.

Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in `scipy/__init__.py` then that may be the most pragmatic solution right now. However, this must be done before `libopenblas` is first loaded or it won't take effect. So if users import numpy first, then setting an env var will already have no effect on that copy of `libopenblas`. It needs testing whether this then still works around the problem or not.

Thoughts on which option seems best? Any other options I missed?

Cheers, Ralf

_______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: einstein.edison@gmail.com

Olivier Grisel

12:47 p.m.

For the long term, I think you should be aware that there seem to be a possibly related problem with spinning thread waits in OpenMP and OpenBLAS threads in code that sequentially calls a Cython prange loop (that uses OpenMP) and a scipy Cython blas function (that uses OpenBLAS): https://github.com/xianyi/OpenBLAS/issues/3187. The active thread spinning when waiting for the next task confused the OS scheduler and degrade the performance by preventing to use the available cores. This situation can be observed in scikit-learn when installed from the PyPI wheels (any linux platform) because the OpenMP runtime used by scikit-learn is `libgomp` (linked into the scikit-learn wheel) and the threading layer used by OpenBLAS is the the internal OpenBLAS threading layer from OpenBLAS built as part of the scipy wheel. When installing everything from conda-forge (and maybe from the anaconda defaults channel as well, I haven't checked), the problem goes away because both the scikit-learn prange loops and OpenBLAS thread operations rely on the same OpenMP runtime (llvm-openmp by default on conda-forge if I am not mistaken). However in the OpenBLAS/OpenMP case, the performance degradation is far below the 50x slowdown observed on this issue. So it might be worth doing a short-term stop gap workaround for the Apple M1 case while keeping in mind that it would be worth investing more time to fix the root cause problem of duplicated threaded runtimes in a single Python program. Indeed, both problems would go away if we had a clean way for wheels to share the same threading runtimes, both for OpenBLAS and OpenMP but doing this would require a significant community wide coordination effort, possibly by implementing something like this oldish proposal by @njsmith: https://mail.python.org/pipermail/wheel-builders/2016-April/000090.html

Olivier Grisel

4:16 p.m.

After some investigations in https://github.com/scipy/scipy/issues/15050, we can confirm that the problem is indeed very similar. I think that for scipy 1.7.3 we can do a stopgap workaround that sets the `OPENBLAS_THREAD_TIMEOUT` env variable to `"1"` in `_distributor_init.py` but only for the macos/arm64 wheel. And then we can discuss if we want to generalize this to numpy and scipy wheels for later major releases and try to tackle a cleaner approach that would avoid duplicated linking of threaded libraries in the scipy stack one way or another.

Ralf Gommers

6:02 p.m.

On Wed, Nov 17, 2021 at 5:17 PM Olivier Grisel <olivier.grisel@ensta.org> wrote:

...

After some investigations in https://github.com/scipy/scipy/issues/15050, we can confirm that the problem is indeed very similar.

I think that for scipy 1.7.3 we can do a stopgap workaround that sets the `OPENBLAS_THREAD_TIMEOUT` env variable to `"1"` in `_distributor_init.py` but only for the macos/arm64 wheel.

That does sound like the right plan to me, thanks for finding this workaround Olivier! And then we can discuss if we want to generalize this to numpy and scipy

...

wheels for later major releases and try to tackle a cleaner approach that would avoid duplicated linking of threaded libraries in the scipy stack one way or another.

That would require quite a bit of design and thinking indeed - it needs doing, but agreed that that's for later. It's inherently problematic with the PyPI model, and there's multiple options for solving it. We also should have a look at our stances on OpenMP - SciPy forbids using it, while scikit-learn is starting to use it more. Cheers, Ralf

Olivier Grisel

12:54 p.m.

...

Rebuild the libopenblas we bundle in the wheel to have a max number of threads of 1, 2, or 4.

This solution would not have been possible in OpenBLAS 0.3.17 and before because of a coupling between the max threadpool size and pre-allocated memory buffers used by OpenBLAS (see https://github.com/xianyi/OpenBLAS/issues/3321) but it seems to have been fixed in OpenBLAS 0.3.18 (https://github.com/xianyi/OpenBLAS/pull/3352) but I did not take the time to check yet.

1137

Age (days ago)

1137

Last active (days ago)

List overview

Download

17 comments

7 participants

participants (7)

Hameer Abbasi
Jeremie du Boisberranger
Matthew Brett
Matti Picus
Olivier Grisel
Ralf Gommers
Stefan van der Walt

releasing wheels for arm64 (M1) macOS 12

tags

participants (7)