For the long term, I think you should be aware that there seem to be a possibly related problem with spinning thread waits in OpenMP and OpenBLAS threads in code that sequentially calls a Cython prange loop (that uses OpenMP) and a scipy Cython blas function (that uses OpenBLAS): https://github.com/xianyi/OpenBLAS/issues/3187. The active thread spinning when waiting for the next task confused the OS scheduler and degrade the performance by preventing to use the available cores. This situation can be observed in scikit-learn when installed from the PyPI wheels (any linux platform) because the OpenMP runtime used by scikit-learn is `libgomp` (linked into the scikit-learn wheel) and the threading layer used by OpenBLAS is the the internal OpenBLAS threading layer from OpenBLAS built as part of the scipy wheel. When installing everything from conda-forge (and maybe from the anaconda defaults channel as well, I haven't checked), the problem goes away because both the scikit-learn prange loops and OpenBLAS thread operations rely on the same OpenMP runtime (llvm-openmp by default on conda-forge if I am not mistaken). However in the OpenBLAS/OpenMP case, the performance degradation is far below the 50x slowdown observed on this issue. So it might be worth doing a short-term stop gap workaround for the Apple M1 case while keeping in mind that it would be worth investing more time to fix the root cause problem of duplicated threaded runtimes in a single Python program. Indeed, both problems would go away if we had a clean way for wheels to share the same threading runtimes, both for OpenBLAS and OpenMP but doing this would require a significant community wide coordination effort, possibly by implementing something like this oldish proposal by @njsmith: https://mail.python.org/pipermail/wheel-builders/2016-April/000090.html