On Sep 30, 2015 2:28 AM, "Daπid" <davidmenhur@gmail.com> wrote:
[...]
> Is there a nice way to ship both versions? After all, most implementations of BLAS and friends do spawn OpenMP threads, so I don't think it would be outrageous to take advantage of it in more places; provided there is a nice way to fallback to a serial version when it is not available.

This is incorrect -- the only common implementation of BLAS that uses *OpenMP* threads is OpenBLAS, and even then it's not the default -- it only happens if you run it in a special non-default configuration.

The challenges to providing transparent multithreading in numpy generally are:

- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to fix this but they still haven't applied it; alternatively there's a workaround you can use in multiprocessing (not using fork mode), but this requires every user update their code and the workaround has other limitations. We're unlikely to use OpenMP while this is the case.

- parallel code in general is not very composable. If someone is calling a numpy operation from one thread, great, transparently using multiple threads internally is a win. If they're exploiting some higher-level structure in their problem to break it into pieces and process each in parallel, and then using numpy on each piece, then numpy spawning threads internally will probably destroy performance. And numpy is too low-level to know which case it's in. This problem exists to some extent already with multi-threaded BLAS, so people use various BLAS-specific knobs to manage it in ad hoc ways, but this doesn't scale.

(Ironically OpenMP is more composable then most approaches to threading, but only if everyone is using it and, as per above, not everyone is and we currently can't.)

-n