Hi all,

I’m speculating here, but something could be going wrong with the scheduler, which schedules things on the low-power cores too often and they have to be moved the high-power cores. Would it be an option to somehow detect the number of high-power cores and limit `OPENBLAS_NUM_THREADS` to that before OpenBLAS is ever loaded? Given M1 max and M1 pro now exist, hard-coding this number would limit performance on those chips, but is definitely better than a 50x slowdown.

Best regards,
Hameer Abbasi

Am 17.11.2021 um 06:57 schrieb Ralf Gommers <ralf.gommers@gmail.com>:

Hi all,

As you may know, we still do not have native wheels up on PyPI for arm64 (M1) macOS. There were/are multiple problems, the most important one a kernel panic (i.e. OS crashes, laptop restarts): https://github.com/scipy/scipy/issues/14688. That issue is now fixed in macOS 12, which was released on 25 Oct 2021. So we can consider releasing wheels for macOS 12 only; we won't for macOS 11.x, so users will just have to upgrade - or just install from conda-forge, that has worked perfectly fine for almost a year.

The main issue that is remaining is a performance issue, see the summary at https://github.com/scipy/scipy/issues/15050. Hopefully we can find the root cause of that issue. In the meantime though, I'd like to discuss what the release strategy should be here. I'll copy what I wrote in the "Can we work around the problem?" section of the issue here:


If we release wheels for macOS 12, many people are going to hit this problem. A 50x slowdown for some code using linalg functionality for the default install configuration of `pip install numpy scipy` does not seem acceptable - that will lead too many users on wild goose chases. On the other hand it should be pointed out that if users build SciPy 1.7.2 from source on a native arm64 Python install, they will anyway hit the same problem. So not releasing any wheels isn't much better; at best it signals to users that they shouldn't use `arm64` just yet but stick with `x86_64` (but that does have some performance implications as well).

At this point it looks like controlling the number of threads that OpenBLAS uses is the way we can work around this problem (or let users do so). Ways to control threading:

- Use `threadpoolctl` (see the README at https://github.com/joblib/threadpoolctl for how)
- Set an environment variable to control the behavior, e.g. `OPENBLAS_NUM_THREADS`
- Rebuild the `libopenblas` we bundle in the wheel to have a max number of threads of 1, 2, or 4.

SciPy doesn't have a `threadpoolctl` runtime dependency, and it doesn't seem desirable to add one just for this issue. Note though that gh-14441 aims to add it as an _optional_ dependency to improve test suite parallelism, and longer term we perhaps do want that dependency. Also, scikit-learn has a hard dependency on it, so many users will already have it installed.

Rebuilding `libopenblas` with a low max number of threads does not allow users who know what they are doing or don't suffer from the problem to optimize threading behavior for their own code. It was pointed out in https://github.com/scipy/scipy/issues/14688#issuecomment-969143657 that this is undesirable.

Setting an environment variable is also not a great thing to do (a library should normally never ever do this), but if it works to do so in `scipy/__init__.py` then that may be the most pragmatic solution right now. However, this must be done before `libopenblas` is first loaded or it won't take effect. So if users import numpy first, then setting an env var will already have no effect on that copy of `libopenblas`. It needs testing whether this then still works around the problem or not.



Thoughts on which option seems best? Any other options I missed?

Cheers,
Ralf

_______________________________________________
SciPy-Dev mailing list -- scipy-dev@python.org
To unsubscribe send an email to scipy-dev-leave@python.org
https://mail.python.org/mailman3/lists/scipy-dev.python.org/
Member address: einstein.edison@gmail.com