Cython-based OpenMP-accelerated quartic polynomial solver
Hi all, I recently developed a Cython-based, OpenMP-accelerated quartic (and cubic, quadratic) polynomial solver to address a personal research need for quickly solving a very large number of independent low-degree polynomial equations for both real and complex coefficients. For example, on an Intel i7-4710MQ, running with 8 threads this code solves approximately 1.2e7 quartic polynomial equations per second. (With OMP_NUM_THREADS=4 the solver gives approximately 8.9e6 solutions per second, so it seems HyperThreading in this case helps about 30%.) The algorithms for cubics and quadratics come from Numerical Recipes (3rd ed.), and the quartic problem is internally reduced to a cubic and two quadratics, using well-known standard tricks. Since to my understanding, for solving polynomial equations NumPy only provides roots(), which works one problem at a time, I'm writing to inquire whether there would be interest to include this functionality to NumPy, if I contribute the code (and clean it up for integration)? I have some previous open source contribution experience. I have authored the IVTC and Phosphor deinterlacers for VLC, and a modular postprocessing filter framework for the Panda3D game engine (posted at the forums on panda3d.org, projected for version 1.10). Currently the polynomial solver is available in a git repository hosted by our university, the relevant files being: https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... I'm working on a research grant, which allows me to hold the copyright; license is BSD. Polysolve2 is the fast Cython+OpenMP version, while polysolve is an earlier (slower) experimental version using only NumPy array operations. The example module contains a usage example, and setup (in the standard manner) instructs distutils and Cython as to how to build polysolve2 (including setting the relevant math flags and enabling OpenMP). (The setup script includes also some stuff specific to my solver for the Ziegler problem; basically, the "tworods" module can be ignored. I apologize for the inconvenience.) Thoughts? Best regards, -J ------------------------------------------------- Juha Jeronen, Ph.D. juha.jeronen@jyu.fi University of Jyväskylä Department of Mathematical Information Technology -------------------------------------------------
This sounds pretty cool -- and I've had a use case. So it would be nice to get into Numpy. But: I doubt we'd want OpenMP dependence in Numpy itself. But maybe a pure Cython non-MP version? Are we putting Cuthon in Numpy now? I lost track. -CHB Sent from my iPhone
On Sep 29, 2015, at 7:35 AM, Juha Jeronen <juha.jeronen@jyu.fi> wrote:
Hi all,
I recently developed a Cython-based, OpenMP-accelerated quartic (and cubic, quadratic) polynomial solver to address a personal research need for quickly solving a very large number of independent low-degree polynomial equations for both real and complex coefficients.
For example, on an Intel i7-4710MQ, running with 8 threads this code solves approximately 1.2e7 quartic polynomial equations per second. (With OMP_NUM_THREADS=4 the solver gives approximately 8.9e6 solutions per second, so it seems HyperThreading in this case helps about 30%.)
The algorithms for cubics and quadratics come from Numerical Recipes (3rd ed.), and the quartic problem is internally reduced to a cubic and two quadratics, using well-known standard tricks.
Since to my understanding, for solving polynomial equations NumPy only provides roots(), which works one problem at a time, I'm writing to inquire whether there would be interest to include this functionality to NumPy, if I contribute the code (and clean it up for integration)?
I have some previous open source contribution experience. I have authored the IVTC and Phosphor deinterlacers for VLC, and a modular postprocessing filter framework for the Panda3D game engine (posted at the forums on panda3d.org, projected for version 1.10).
Currently the polynomial solver is available in a git repository hosted by our university, the relevant files being:
https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods... https://yousource.it.jyu.fi/jjrandom2/miniprojects/blobs/master/misc/tworods...
I'm working on a research grant, which allows me to hold the copyright; license is BSD.
Polysolve2 is the fast Cython+OpenMP version, while polysolve is an earlier (slower) experimental version using only NumPy array operations. The example module contains a usage example, and setup (in the standard manner) instructs distutils and Cython as to how to build polysolve2 (including setting the relevant math flags and enabling OpenMP).
(The setup script includes also some stuff specific to my solver for the Ziegler problem; basically, the "tworods" module can be ignored. I apologize for the inconvenience.)
Thoughts?
Best regards,
-J
------------------------------------------------- Juha Jeronen, Ph.D. juha.jeronen@jyu.fi University of Jyväskylä Department of Mathematical Information Technology -------------------------------------------------
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
On Tue, Sep 29, 2015 at 6:48 PM, Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:
This sounds pretty cool -- and I've had a use case. So it would be nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
But maybe a pure Cython non-MP version?
Are we putting Cuthon in Numpy now? I lost track.
Yes, but we need to be careful of Numeric Recipes. <snip> Chuck
Hi, On 30.09.2015 04:37, Charles R Harris wrote:
On Tue, Sep 29, 2015 at 6:48 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov <mailto:chris.barker@noaa.gov>> wrote:
This sounds pretty cool -- and I've had a use case. So it would be nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
But maybe a pure Cython non-MP version?
Are we putting Cuthon in Numpy now? I lost track.
Yes, but we need to be careful of Numeric Recipes.
True. As I said, only the algorithms come from NR, and the code was written from scratch. I haven't even looked at any of their code, precisely due to the restrictive license. In this particular case, the referred pages (pp. 227-229, 3rd ed.) contain only a description of the solution process in English, with equations - there is no example code printed into the book in section 5.6, Quadratic and Cubic Equations, which in its entirety is contained on pp. 227-229. Furthermore, on p. 228, equation (5.6.12), which contains the solutions of the cubic equation, is attributed to Francois Viete. The reference given is the treatise "De emendatione", published in 1615. The same solution, also with attribution to Francois Viete, is given in Wikipedia, but without especially defining temporary variables to eliminate some common subexpressions: https://en.wikipedia.org/wiki/Cubic_function#Trigonometric_.28and_hyperbolic... The solution to the quadratic equation is the standard one, but with special care taken to avoid catastrophic cancellation in computing the other root. Wikipedia mentions that in the standard formula, this issue exists: https://en.wikipedia.org/wiki/Quadratic_equation#Avoiding_loss_of_significan... but does not give an explicit remedy. References given are Kahan, Willian (November 20, 2004), On the Cost of Floating-Point Computation Without Extra-Precise Arithmetic. ( http://www.cs.berkeley.edu/~wkahan/Qdrtcs.pdf , see esp. p. 8, which provides the same remedy as NR ) Higham, Nicholas (2002), Accuracy and Stability of Numerical Algorithms (2nd ed.), SIAM, p. 10, ISBN 978-0-89871-521-7. (Browsing through Kahan's text, I see now that I could improve the discriminant computation to handle marginal cases better. Might do this in a future version.) For both of these cases, I've used NR as a reference, because the authors have taken special care to choose only numerically stable approaches - which is absolutely critical, yet something that sadly not all mathematics texts care about. The quartic equation is not treated in NR. The reduction trick is reported in Wikipedia: https://en.wikipedia.org/wiki/Quartic_function#Solving_by_factoring_into_qua... where the original reference given is Brookfield, G. (2007). "Factoring quartic polynomials: A lost art". Mathematics Magazine 80 (1): 67–70. (This seemed an elegant approach, given stable solvers for cubics and quadratics.) -J ------------------------------------------------- Juha Jeronen, Ph.D. juha.jeronen@jyu.fi University of Jyväskylä Department of Mathematical Information Technology -------------------------------------------------
Yes, obviously, the code has NR parts, so it can't be licensed as BSD as it is... Matthieu 2015-09-30 2:37 GMT+01:00 Charles R Harris <charlesr.harris@gmail.com>:
On Tue, Sep 29, 2015 at 6:48 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
This sounds pretty cool -- and I've had a use case. So it would be nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
But maybe a pure Cython non-MP version?
Are we putting Cuthon in Numpy now? I lost track.
Yes, but we need to be careful of Numeric Recipes.
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher
Hi, What qualifies as an NR part? (See my previous message concerning the references; NR is not the only source which has these algorithms. Again, the code itself is original to this solver, I haven't even looked at the code provided with NR.) -J ------------------------------------------------- Juha Jeronen, Ph.D. juha.jeronen@jyu.fi University of Jyväskylä Department of Mathematical Information Technology ------------------------------------------------- On 30.09.2015 12:35, Matthieu Brucher wrote:
Yes, obviously, the code has NR parts, so it can't be licensed as BSD as it is...
Matthieu
2015-09-30 2:37 GMT+01:00 Charles R Harris <charlesr.harris@gmail.com>:
On Tue, Sep 29, 2015 at 6:48 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
This sounds pretty cool -- and I've had a use case. So it would be nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
But maybe a pure Cython non-MP version?
Are we putting Cuthon in Numpy now? I lost track.
Yes, but we need to be careful of Numeric Recipes.
<snip>
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Sep 30, 2015 at 10:35 AM, Matthieu Brucher < matthieu.brucher@gmail.com> wrote:
Yes, obviously, the code has NR parts, so it can't be licensed as BSD as it is...
It's not obvious to me, especially after Juha's further clarifications. -- Robert Kern
After, I agree with you. 2015-09-30 18:14 GMT+01:00 Robert Kern <robert.kern@gmail.com>:
On Wed, Sep 30, 2015 at 10:35 AM, Matthieu Brucher <matthieu.brucher@gmail.com> wrote:
Yes, obviously, the code has NR parts, so it can't be licensed as BSD as it is...
It's not obvious to me, especially after Juha's further clarifications.
-- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Information System Engineer, Ph.D. Blog: http://matt.eifelle.com LinkedIn: http://www.linkedin.com/in/matthieubrucher
Hi, On 30.09.2015 03:48, Chris Barker - NOAA Federal wrote:
This sounds pretty cool -- and I've had a use case. So it would be nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
It is indeed a good point not to add new dependencies for a small feature such as this. From a software engineering point of view, I fully agree. Note however that with the current version of the code, not having OpenMP will severely limit the performance, especially on quad-core machines, since an important factor in the speed comes from the parallel processing of the independent problem instances. But I suppose there may be another technical option to support multiple cores (doesn't NumPy already do this in some of its routines?). OpenMP was just the most convenient solution from the implementation viewpoint.
But maybe a pure Cython non-MP version?
That is of course possible. IIRC, changing the processing to occur on a single core should only require replacing Cython.parallel.prange() with range() in the array processing loops. (Note range(), not xrange(), even for Python 2.x - Cython compiles a loop using range() into an efficient C loop if some simple compile-time conditions are fulfilled.)
Are we putting Cuthon in Numpy now? I lost track.
I thought so? But then again, I'm just a user :) -J ------------------------------------------------- Juha Jeronen, Ph.D. juha.jeronen@jyu.fi University of Jyväskylä Department of Mathematical Information Technology -------------------------------------------------
On 30 September 2015 at 10:20, Juha Jeronen <juha.jeronen@jyu.fi> wrote:
Are we putting Cuthon in Numpy now? I lost track.
I thought so? But then again, I'm just a user :)
As of this moment, there are three Cython modules in Numpy, so yes. Releases ship the C generated modules, so it is not a dependency. ./numpy/distutils/tests/pyrex_ext/primes.pyx ./numpy/random/mtrand/mtrand.pyx ./tools/allocation_tracking/alloc_hook.pyx In any case, we should always include a single threaded version because sometimes the computations are already parallelised at a higher level. Is there a nice way to ship both versions? After all, most implementations of BLAS and friends do spawn OpenMP threads, so I don't think it would be outrageous to take advantage of it in more places; provided there is a nice way to fallback to a serial version when it is not available. /David.
Is there a nice way to ship both versions? After all, most implementations of BLAS and friends do spawn OpenMP threads, so I don't
On Sep 30, 2015 2:28 AM, "Daπid" <davidmenhur@gmail.com> wrote: [...] think it would be outrageous to take advantage of it in more places; provided there is a nice way to fallback to a serial version when it is not available. This is incorrect -- the only common implementation of BLAS that uses *OpenMP* threads is OpenBLAS, and even then it's not the default -- it only happens if you run it in a special non-default configuration. The challenges to providing transparent multithreading in numpy generally are: - gcc + OpenMP on linux still breaks multiprocessing. There's a patch to fix this but they still haven't applied it; alternatively there's a workaround you can use in multiprocessing (not using fork mode), but this requires every user update their code and the workaround has other limitations. We're unlikely to use OpenMP while this is the case. - parallel code in general is not very composable. If someone is calling a numpy operation from one thread, great, transparently using multiple threads internally is a win. If they're exploiting some higher-level structure in their problem to break it into pieces and process each in parallel, and then using numpy on each piece, then numpy spawning threads internally will probably destroy performance. And numpy is too low-level to know which case it's in. This problem exists to some extent already with multi-threaded BLAS, so people use various BLAS-specific knobs to manage it in ad hoc ways, but this doesn't scale. (Ironically OpenMP is more composable then most approaches to threading, but only if everyone is using it and, as per above, not everyone is and we currently can't.) -n
On 30.09.2015 19:20, Nathaniel Smith wrote:
The challenges to providing transparent multithreading in numpy generally are:
- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to fix this but they still haven't applied it; alternatively there's a workaround you can use in multiprocessing (not using fork mode), but this requires every user update their code and the workaround has other limitations. We're unlikely to use OpenMP while this is the case.
Ah, I didn't know this. Thanks.
- parallel code in general is not very composable. If someone is calling a numpy operation from one thread, great, transparently using multiple threads internally is a win. If they're exploiting some higher-level structure in their problem to break it into pieces and process each in parallel, and then using numpy on each piece, then numpy spawning threads internally will probably destroy performance. And numpy is too low-level to know which case it's in. This problem exists to some extent already with multi-threaded BLAS, so people use various BLAS-specific knobs to manage it in ad hoc ways, but this doesn't scale.
Very good point. I've had both kinds of use cases myself. It would be nice if there was some way to tell NumPy to either use additional threads or not, but that adds complexity. It's also not a good solution, considering that any higher-level code building on NumPy, if it is designed to be at all reusable, may find *itself* in either role. Only the code that, at any particular point of time in the development of a software project, happens to form the top level at that time, has the required context... Then again, the matter is further complicated by considering codes that run on a single machine, versus codes that run on a cluster. Threads being local to each node in a cluster, it may make sense in a solver targeted for a cluster to split the problem at the process level, distribute the processes across the network, and use the threading capability to accelerate computation on each node. A complex issue with probably no easy solutions :) -J
On 01/10/15 02:20, Juha Jeronen wrote:
Then again, the matter is further complicated by considering codes that run on a single machine, versus codes that run on a cluster.Threads being local to each node in a cluster,
You can run MPI programs on a single machine and you get OpenMP implementations for clusters. Just pick an API and stick with it. Sturla
On 01.10.2015 03:32, Sturla Molden wrote:
On 01/10/15 02:20, Juha Jeronen wrote:
Then again, the matter is further complicated by considering codes that run on a single machine, versus codes that run on a cluster.Threads being local to each node in a cluster,
You can run MPI programs on a single machine and you get OpenMP implementations for clusters. Just pick an API and stick with it.
Mm. I've quite often run MPI locally (it's nice for multicore scientific computing on Python), but I had no idea that OpenMP had cluster implementations. Thanks for the tip. I've got the impression that the way these APIs market themselves is that MPI is for processes, while OpenMP is for threads, even if this is not completely true across all implementations. (If I wanted maximal control over what each process/thread is doing, I'd go for ZeroMQ :) ) -J
Juha Jeronen <juha.jeronen@jyu.fi> wrote:
Mm. I've quite often run MPI locally (it's nice for multicore scientific computing on Python), but I had no idea that OpenMP had cluster implementations. Thanks for the tip.
Intel has been selling one, I think there are others too. OpenMP has a flush pragma for synchronizing shared variables. This means that OpenMP is not restricted to shared memory hardware. A "pragma omp flush" can just as well invoke some IPC mechanism, even network communication. Sturla
Sturla Molden <sturla.molden@gmail.com> wrote:
OpenMP has a flush pragma for synchronizing shared variables. This means that OpenMP is not restricted to shared memory hardware. A "pragma omp flush" can just as well invoke some IPC mechanism, even network communication.
By the way, while this is the case for C and Fortran, it is certainly not the case for Cython. In a Cython prange block, a shared variable is accessed by dereferencing its address. This requires shared memory. Pure OpenMP in C does not, because shared variables are not accessed through pointers, but are rather normal variables that are synchronized with a pragma. Cython actually requires that there is a shared address space, and it invokes something that strictly speaking has undefined behavior under the OpenMP standard. So thus, a prange block in Cython is expected to work correctly on a laptop with a multicore processor, but it is not expected to work correctly on a cluster. IIRC, Intel's cluster OpenMP is based on MPI, which means the compiler will internally translate code with OpenMP pragmas into equivalent code that calls MPI functions. A program written for OpenMP can then run on any cluster that provides an MPI implementation.
Sturla Molden <sturla.molden@gmail.com> wrote:
Cython actually requires that there is a shared address space, and it invokes something that strictly speaking has undefined behavior under the OpenMP standard. So thus, a prange block in Cython is expected to work correctly on a laptop with a multicore processor, but it is not expected to work correctly on a cluster.
OpenMP does not guarrantee that dereferencing a pointer in a parallel block will dereference the same object across all processors. It only guarrantees that the value of a shared object can be synchronized. There are many who use OpenMP and think only in terms of threads that do this incorrectly. Cython is actually among those. S.M.
On 30 September 2015 at 18:20, Nathaniel Smith <njs@pobox.com> wrote:
Is there a nice way to ship both versions? After all, most implementations of BLAS and friends do spawn OpenMP threads, so I don't
On Sep 30, 2015 2:28 AM, "Daπid" <davidmenhur@gmail.com> wrote: [...] think it would be outrageous to take advantage of it in more places; provided there is a nice way to fallback to a serial version when it is not available.
This is incorrect -- the only common implementation of BLAS that uses *OpenMP* threads is OpenBLAS, and even then it's not the default -- it only happens if you run it in a special non-default configuration.
Right, sorry. I wanted to say they spawn parallel threads. What do you mean by a non default configuration? Setting he OMP_NUM_THREADS?
The challenges to providing transparent multithreading in numpy generally are:
- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to fix this but they still haven't applied it; alternatively there's a workaround you can use in multiprocessing (not using fork mode), but this requires every user update their code and the workaround has other limitations. We're unlikely to use OpenMP while this is the case.
Any idea when is this going to be released? As I understand it, OpenBLAS doesn't have this problem, am I right?
- parallel code in general is not very composable. If someone is calling a numpy operation from one thread, great, transparently using multiple threads internally is a win. If they're exploiting some higher-level structure in their problem to break it into pieces and process each in parallel, and then using numpy on each piece, then numpy spawning threads internally will probably destroy performance. And numpy is too low-level to know which case it's in. This problem exists to some extent already with multi-threaded BLAS, so people use various BLAS-specific knobs to manage it in ad hoc ways, but this doesn't scale.
(Ironically OpenMP is more composable then most approaches to threading, but only if everyone is using it and, as per above, not everyone is and we currently can't.)
That is what I meant with providing also a single threaded version. <https://mail.scipy.org/mailman/listinfo/numpy-discussion>The user can choose if they want the parallel or the serial, depending on the case.
On Wed, Sep 30, 2015 at 11:54 PM, Daπid <davidmenhur@gmail.com> wrote:
On 30 September 2015 at 18:20, Nathaniel Smith <njs@pobox.com> wrote:
On Sep 30, 2015 2:28 AM, "Daπid" <davidmenhur@gmail.com> wrote: [...]
Is there a nice way to ship both versions? After all, most implementations of BLAS and friends do spawn OpenMP threads, so I don't think it would be outrageous to take advantage of it in more places; provided there is a nice way to fallback to a serial version when it is not available.
This is incorrect -- the only common implementation of BLAS that uses *OpenMP* threads is OpenBLAS, and even then it's not the default -- it only happens if you run it in a special non-default configuration.
Right, sorry. I wanted to say they spawn parallel threads. What do you mean by a non default configuration? Setting he OMP_NUM_THREADS?
I don't remember the details -- I think it might be a special setting you have to enable when you build OpenBLAS.
The challenges to providing transparent multithreading in numpy generally are:
- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to fix this but they still haven't applied it; alternatively there's a workaround you can use in multiprocessing (not using fork mode), but this requires every user update their code and the workaround has other limitations. We're unlikely to use OpenMP while this is the case.
Any idea when is this going to be released?
Which? The gcc patch? I spent 2 full release cycles nagging them and they still can't be bothered to make a decision either way, so :-(. If anyone has some ideas for how to get traction in gcc-land then I'm happy to pass on details...
As I understand it, OpenBLAS doesn't have this problem, am I right?
Right, in the default configuration then OpenBLAS will use its own internal thread pool code, and that code has the fixes needed to work with fork-based multiprocessing. Of course if you configure OpenBLAS to use OpenMP instead of its internal thread code then this no longer applies... -n -- Nathaniel J. Smith -- http://vorpus.org
On 1 October 2015 at 09:05, Nathaniel Smith <njs@pobox.com> wrote:
- gcc + OpenMP on linux still breaks multiprocessing. There's a patch to fix this but they still haven't applied it; alternatively there's a workaround you can use in multiprocessing (not using fork mode), but this requires every user update their code and the workaround has other limitations. We're unlikely to use OpenMP while this is the case.
Any idea when is this going to be released?
Which? The gcc patch? I spent 2 full release cycles nagging them and they still can't be bothered to make a decision either way, so :-(. If anyone has some ideas for how to get traction in gcc-land then I'm happy to pass on details...
:( Have you tried asking Python-dev for help with this? Hopefully they would have some weight there.
On 02/10/15 13:05, Daπid wrote:
Have you tried asking Python-dev for help with this? Hopefully they would have some weight there.
It seems both GCC dev and Apple (for GCD and Accelerate) has taken a similar stance on this. There is tiny set of functions the POSIX standard demands should work on both sides of a fork without exec, but OpenMP, GCD, BLAS or LAPAPCK are not included. As long as there is no bug, it is hard to convince them to follow Intel and allow fork-based multiprocessing. As it stands now, using Intel compilers and MKL is the only way to make this work, but Intel's development tools are not freeware. Sturla
On Mon, Oct 5, 2015 at 2:52 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 02/10/15 13:05, Daπid wrote:
Have you tried asking Python-dev for help with this? Hopefully they would have some weight there.
It seems both GCC dev and Apple (for GCD and Accelerate) has taken a similar stance on this. There is tiny set of functions the POSIX standard demands should work on both sides of a fork without exec, but OpenMP, GCD, BLAS or LAPAPCK are not included. As long as there is no bug, it is hard to convince them to follow Intel and allow fork-based multiprocessing.
To be clear, the GCC devs are open to supporting fork+OpenMP in principle, they just aren't willing to do it in a way that risks breaking strict POSIX or OpenMP compatibility. But that isn't even the problem -- we have a patch that is strictly compatible with POSIX and OpenMP. The problem is that with the patch, the cases that would formerly have deadlocked instead leak some memory. This is not a big deal IMO for a variety of reasons (mostly that a one time leak per child process is tiny, esp. compared to the current situation with deadlocks), but it means the patch needs someone serious in the GCC community to take a look at it carefully, understand what the tradeoffs actually are, and make a judgement call. And so far we haven't convinced anyone to do this. -n -- Nathaniel J. Smith -- http://vorpus.org
On Mon, Oct 5, 2015 at 3:05 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Mon, Oct 5, 2015 at 2:52 PM, Sturla Molden <sturla.molden@gmail.com> wrote:
On 02/10/15 13:05, Daπid wrote:
Have you tried asking Python-dev for help with this? Hopefully they would have some weight there.
It seems both GCC dev and Apple (for GCD and Accelerate) has taken a similar stance on this. There is tiny set of functions the POSIX standard demands should work on both sides of a fork without exec, but OpenMP, GCD, BLAS or LAPAPCK are not included. As long as there is no bug, it is hard to convince them to follow Intel and allow fork-based multiprocessing.
To be clear, the GCC devs are open to supporting fork+OpenMP in principle, they just aren't willing to do it in a way that risks breaking strict POSIX or OpenMP compatibility. But that isn't even the problem -- we have a patch that is strictly compatible with POSIX and OpenMP. The problem is that with the patch, the cases that would formerly have deadlocked instead leak some memory. This is not a big deal IMO for a variety of reasons (mostly that a one time leak per child process is tiny, esp. compared to the current situation with deadlocks), but it means the patch needs someone serious in the GCC community to take a look at it carefully, understand what the tradeoffs actually are, and make a judgement call. And so far we haven't convinced anyone to do this.
Since this discussion's come around again, I finally got curious enough to check the Intel OpenMP runtime's new(ish) open source releases, and it turns out that they leak memory in exactly the same way as the gcc patch :-). -- Nathaniel J. Smith -- http://vorpus.org
On 30 September 2015 at 18:20, Nathaniel Smith <njs@pobox.com> wrote:
- parallel code in general is not very composable. If someone is calling a numpy operation from one thread, great, transparently using multiple threads internally is a win. If they're exploiting some higher-level structure in their problem to break it into pieces and process each in parallel, and then using numpy on each piece, then numpy spawning threads internally will probably destroy performance. And numpy is too low-level to know which case it's in. This problem exists to some extent already with multi-threaded BLAS, so people use various BLAS-specific knobs to manage it in ad hoc ways, but this doesn't scale.
One idea: what about creating a "parallel numpy"? There are a few algorithms that can benefit from parallelisation. This library would mimic Numpy's signature, and the user would be responsible for choosing the single threaded or the parallel one by just changing np.function(x, y) to pnp.function(x, y) If that were deemed a good one, what would be the best parallelisation scheme? OpenMP? Threads?
On Tue, Oct 6, 2015 at 1:14 AM, Daπid <davidmenhur@gmail.com> wrote:
One idea: what about creating a "parallel numpy"? There are a few algorithms that can benefit from parallelisation. This library would mimic Numpy's signature, and the user would be responsible for choosing the single threaded or the parallel one by just changing np.function(x, y) to pnp.function(x, y)
I would recommend taking a look at dask.array [1], which in many cases works exactly like a parallel NumPy, though it also does lazy and out-of-core computation. It's a new project, but it's remarkably mature -- we use it as an alternative array backend (to numpy) in xray, and it's also being used by scikit-image. [1] http://dask.pydata.org/en/latest/array.html
If that were deemed a good one, what would be the best parallelisation scheme? OpenMP? Threads?
Dask uses threads. That works pretty well as long as all the hard work is calling into something that releases the GIL (which includes NumPy, of course).
On 30/09/15 11:27, Daπid wrote:
Is there a nice way to ship both versions? After all, most implementations of BLAS and friends do spawn OpenMP threads, so I don't think it would be outrageous to take advantage of it in more places;
Some do, others don't. ACML uses OpenMP. GotoBLAS uses OpenMP. Intel MKL uses Intel TBB and OpenMP (both of them). OpenBLAS will by default use an internal threadpool. It can be configured to use OpenMP instead. ATLAS uses its own threadpool. Apple Accelerate Framework uses a kernel thread-pool called the Grand Central Dispatch (GCD). A library called libdispatch uses kqueue to access the GCD. There are two principal problems with using OpenMP in NumPy: One is that the GNU OpenMP threadpool is not fork-safe, and can break multiprocessing on some platforms (e.g. when using Python 2.7 on Linux). Anything that uses GCD has this nasty effect on Apple and FreeBSD as well. Note that the problem is actually in multiprocessing. It is not present on Windows (there is no fork system call) and it is avoidable even on Linux with Python 3.4 or later. Also the default builds of NumPy and SciPy on MacOSX uses GCD by default. The other problem is that NumPy with its iterator objects and gufuncs is not particularly fit for multithreading. There will be a lot of contention for the iterator object. Without a major redesign, one thread would do useful work while most of the others would be busy-waiting on a spinlock and fighting for iterator access. Not to mention that the iterator object would be false shared between the processors, which would trash the performance if you have more than one CPU, even when compared to using a single thread. This means that for multithreading to be useful in NumPy, the loops will have to be redesigned so that the work sharing between the threads is taken care of ahead of creating the iterator, i.e. allowing us to use one iterator per thread. Each iterator would then iterate over a slice of the original array. This is ok, but we cannot simply do this by adding OpenMP pragmas in the C code. Given the two problems mentioned above, it would likely be better to use a fork-safe threadpool instead of OpenMP. Sturla
On 01.10.2015 01:04, Sturla Molden wrote:
There are two principal problems with using OpenMP in NumPy:
One is that the GNU OpenMP threadpool is not fork-safe, and can break multiprocessing on some platforms (e.g. when using Python 2.7 on Linux). Anything that uses GCD has this nasty effect on Apple and FreeBSD as well. Note that the problem is actually in multiprocessing. It is not present on Windows (there is no fork system call) and it is avoidable even on Linux with Python 3.4 or later. Also the default builds of NumPy and SciPy on MacOSX uses GCD by default.
The other problem is that NumPy with its iterator objects and gufuncs is not particularly fit for multithreading. There will be a lot of contention for the iterator object. Without a major redesign, one thread would do useful work while most of the others would be busy-waiting on a spinlock and fighting for iterator access. Not to mention that the iterator object would be false shared between the processors, which would trash the performance if you have more than one CPU, even when compared to using a single thread. This means that for multithreading to be useful in NumPy, the loops will have to be redesigned so that the work sharing between the threads is taken care of ahead of creating the iterator, i.e. allowing us to use one iterator per thread. Each iterator would then iterate over a slice of the original array. This is ok, but we cannot simply do this by adding OpenMP pragmas in the C code.
Thanks for the details.
Given the two problems mentioned above, it would likely be better to use a fork-safe threadpool instead of OpenMP.
Sounds good. Out of curiosity, are there any standard fork-safe threadpools, or would this imply rolling our own? Also, after Chris's, Nathaniel's and your replies, I'm no longer sure an attempt at transparent multithreading belongs in the polynomial solver. Although the additional performance would be (very) nice when running it from a single-threaded program - after all, quad-cores are common these days - it is not just this specific solver that encounters the issue, but instead the same considerations apply to basically all NumPy operations. So maybe it would be better, at least at first, to make a pure-Cython version with no attempt at multithreading? -J
On 01/10/15 02:32, Juha Jeronen wrote:
Sounds good. Out of curiosity, are there any standard fork-safe threadpools, or would this imply rolling our own?
You have to roll your own. Basically use pthreads_atfork to register a callback that shuts down the threadpool before a fork and another callback that restarts it. Python's threading module does not expose the pthreads_atfork function, so you must call it from Cython. I believe there is also a tiny atfork module in PyPI.
So maybe it would be better, at least at first, to make a pure-Cython version with no attempt at multithreading?
I would start by making a pure Cython version that works correctly. The next step would be to ensure that it releases the GIL. After that you can worry about parallel processing, or just tell the user to use threads or joblib. Sturla
On 01.10.2015 03:52, Sturla Molden wrote:
On 01/10/15 02:32, Juha Jeronen wrote:
Sounds good. Out of curiosity, are there any standard fork-safe threadpools, or would this imply rolling our own?
You have to roll your own.
Basically use pthreads_atfork to register a callback that shuts down the threadpool before a fork and another callback that restarts it. Python's threading module does not expose the pthreads_atfork function, so you must call it from Cython.
I believe there is also a tiny atfork module in PyPI.
Ok. Thanks. This approach fixes the issue of the threads not being there for the child process. I think it still leaves open the issue of creating the correct number of threads in the pools for each of the processes when the pool is restarted (so that in total there will be as many threads as cores (physical or virtual, whichever the user desires)). But this is again something that requires context...
So maybe it would be better, at least at first, to make a pure-Cython version with no attempt at multithreading?
I would start by making a pure Cython version that works correctly. The next step would be to ensure that it releases the GIL. After that you can worry about parallel processing, or just tell the user to use threads or joblib.
First version done and uploaded: https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysol... OpenMP support removed; this version uses only Cython. The example program has been renamed to main.py, and setup.py has been cleaned, removing the irrelevant module. This folder contains only the files for the polynomial solver. As I suspected, removing OpenMP support only required changing a few lines, and dropping the import for Cython.parallel. The "prange"s have been replaced with "with nogil" and "range". Note that both the original version and this version release the GIL when running the processing loops. It may be better to leave this single-threaded for now. Using Python threads isn't that difficult and joblib sounds nice, too. What's the next step? -J
On 2 October 2015 at 11:58, Juha Jeronen <juha.jeronen@jyu.fi> wrote:
First version done and uploaded:
https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysol...
Small comment: now you are checking if the input is a scalar or a ndarray, but it should also accept any array-like. If I pass a list, I expect it to work, internally converting it into an array.
On 02.10.2015 13:07, Daπid wrote:
On 2 October 2015 at 11:58, Juha Jeronen <juha.jeronen@jyu.fi <mailto:juha.jeronen@jyu.fi>> wrote:
First version done and uploaded:
https://yousource.it.jyu.fi/jjrandom2/miniprojects/trees/master/misc/polysol...
Small comment: now you are checking if the input is a scalar or a ndarray, but it should also accept any array-like. If I pass a list, I expect it to work, internally converting it into an array.
Good catch. Is there an official way to test for array-likes? Or should I always convert with asarray()? Or something else? -J
On Wed, Sep 30, 2015 at 3:20 AM, Juha Jeronen <juha.jeronen@jyu.fi> wrote:
Hi,
On 30.09.2015 03:48, Chris Barker - NOAA Federal wrote:
This sounds pretty cool -- and I've had a use case. So it would be nice to get into Numpy.
But: I doubt we'd want OpenMP dependence in Numpy itself.
It is indeed a good point not to add new dependencies for a small feature such as this. From a software engineering point of view, I fully agree.
Note however that with the current version of the code, not having OpenMP will severely limit the performance, especially on quad-core machines, since an important factor in the speed comes from the parallel processing of the independent problem instances.
But I suppose there may be another technical option to support multiple cores (doesn't NumPy already do this in some of its routines?). OpenMP was just the most convenient solution from the implementation viewpoint.
But maybe a pure Cython non-MP version?
That is of course possible. IIRC, changing the processing to occur on a single core should only require replacing Cython.parallel.prange() with range() in the array processing loops.
(Note range(), not xrange(), even for Python 2.x - Cython compiles a loop using range() into an efficient C loop if some simple compile-time conditions are fulfilled.)
Are we putting Cuthon in Numpy now? I lost track.
I thought so? But then again, I'm just a user :)
For what it's worth (no idea what the order of magnitude of the technical risk is for something like this in a library like numpy), it's actually quite simple to dynamically test for OpenMP support at install time. Here's what we do in yt: https://bitbucket.org/yt_analysis/yt/src/f8934e13abaf349f58c52c076320d44ee4f61a7f/yt/utilities/lib/setup.py?at=yt&fileviewer=file-view-default#setup.py-8 Basically, just try to compile a simple OpenMP test program in a subprocess. If that succeeds, then great, we can add -fopenmp as a compilation flag. If not, don't do that. This was implemented when Apple switched from GCC 4.2 to LLVM/Clang in OS X 10.7. We've had exactly one bug report about this, and it wasn't even related to OpenMP, instead it was an issue with the way we were using subprocess which broke when someone had set "CC = ccache gcc". Hope that's helpful, Nathan Goldbaum
-J
------------------------------------------------- Juha Jeronen, Ph.D. juha.jeronen@jyu.fi University of Jyväskylä Department of Mathematical Information Technology -------------------------------------------------
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion
On Wed, Sep 30, 2015 at 6:57 AM, Nathan Goldbaum <nathan12343@gmail.com> wrote:
Note however that with the current version of the code, not having OpenMP
will severely limit the performance, especially on quad-core machines, since an important factor in the speed comes from the parallel processing of the independent problem instances.
It seems you got much more than 4 times performance improvement -- which is the most you'll get from four cores, presumably :-). not that another 4 times or so isn't a good thing. But I suppose there may be another technical option to support multiple
cores
python threads with nogil?
For what it's worth (no idea what the order of magnitude of the technical risk is for something like this in a library like numpy), it's actually quite simple to dynamically test for OpenMP support at install time.
Basically, just try to compile a simple OpenMP test program in a subprocess. If that succeeds, then great, we can add -fopenmp as a compilation flag. If not, don't do that.
this sounds like an compilation-tiem tiem test, not isntall time. And install time isn't really helpful either, we really want plan old wheels, etc to work. We'd need a run-time check. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov
On 30.09.2015 19:20, Chris Barker wrote:
On Wed, Sep 30, 2015 at 6:57 AM, Nathan Goldbaum <nathan12343@gmail.com <mailto:nathan12343@gmail.com>> wrote:
Note however that with the current version of the code, not having OpenMP will severely limit the performance, especially on quad-core machines, since an important factor in the speed comes from the parallel processing of the independent problem instances.
It seems you got much more than 4 times performance improvement
True, good point. That's the power of a short explicit algorithm for a specific problem :) (And maybe some Cython, too.)
-- which is the most you'll get from four cores, presumably :-). not that another 4 times or so isn't a good thing.
True in general :) To be precise, in this case, HyperThreading seems to offer some extra performance. And there seems to be something wonky with the power management on my laptop. The number I posted (1.2e7 quartics per second) seemed slightly low compared to earlier informal benchmarking I had done, and indeed, benchmarking with 8 threads again I'm now getting 1.6e7 quartics per second (i7-4710MQ at 2.5 GHz, RAM at 1333 MHz, running on Linux Mint 17.2). Here are some more complete benchmark results. Because the polynomials to be solved are randomly generated during each run of the program, unless otherwise noted, I've rerun the benchmark three times for each number of threads and picked the lowest-performance result, rounded down. Number of threads vs. quartics solved per second: 1: 3.0e6 2: 6.1e6 4: 1.1e7 8: 1.6e7 (note: higher than previously!) Threads vs. cubics: 1: 8.4e6 2: 1.6e7 4: 3.0e7 8: 4.6e7 Threads vs. quadratics: 1: 2.5e7 2: 4.8e7 4: 8.5e7 (typical across 9 benchmarks; lowest 5.7e7) 8: 9.5e7 From these, in general the weak scaling seems pretty good, and for quadratics and cubics, HyperThreading offers some extra performance, namely about 50% over the 4-thread result in both cases. The total speedup for quartics is 5.3x (3.6x without HT), and for cubics, 5.4x (3.5x without HT). "About 4x, as expected" is still a good characterization, though :) For quadratics, 4 threads seems to be the useful maximum; improvement with HT is only 11%, with a total speedup of 3.8x (vs. 3.4x without HT). These require much fewer arithmetic operations than the higher-order cases, so a possible explanation is that this case is memory-bound. Running the numbers on a simplistic estimate, the total of input and output is far from hitting the DDR3 bandwidth limit, but of course that's ignoring (at least) latencies and the pattern of data accesses. Testing the completely silly "linear" case that is there for documenting code structure only, I get: 1: 1.9e8 2: 3.1e8 4: 3.6e8 8: 2.9e8 which should give a reasonable first approximation for the maximum practical throughput on this machine, when performing almost no arithmetic. Anyway, returning from the sidetrack...
But I suppose there may be another technical option to support multiple cores
python threads with nogil?
...maybe? I hadn't thought about this option. I suppose that in pseudocode, the code structure would need to be something like this? ---8<---8<---8<--- from threading import Thread cdef fast_solver(...) nogil: # ...do stuff without touching Python objects... def my_thread_entry(j_start, j_end): cdef unsigned int j with nogil: for j in range(j_start, j_end): fast_solver(...) # process local slice t1 = Thread( target=my_thread_entry, args=(0, j_split) ) t2 = Thread( target=my_thread_entry, args=(j_split, j_end_global) ) t1.start() t2.start() t1.join() t2.join() ---8<---8<---8<--- -J
On 30/09/15 15:57, Nathan Goldbaum wrote:
Basically, just try to compile a simple OpenMP test program in a subprocess. If that succeeds, then great, we can add -fopenmp as a compilation flag. If not, don't do that.
Unless you use GCC on Linux, it will be more complex than that. I.e. do you need -fopenmp, -openmp or /openmp? And which libraries do you need to link, if any? Link GOMP? Link some other OpenMP runtime libraries? Link pthreads? There is three different pthread implementations to choose between on Windows, depending on GCC version (MinGW, MinGW-w64, and Cygwin need different pthread libs for OpenMP). It gets messy. The only clean way is to add the correct flags and libraries into the compiler classes in numpy.distutils, and then let distutils and bento query those. STurla
Juha Jeronen <juha.jeronen <at> jyu.fi> writes:
I recently developed a Cython-based, OpenMP-accelerated quartic (and cubic, quadratic) polynomial solver to address a personal research need for quickly solving a very large number of independent low-degree polynomial equations for both real and complex coefficients.
My 2c in this context would be to also think how this best fits in how collections of polynomials are represented in Numpy. AFAICS, Numpy's polynomial module supports evaluation of polynomial collections (cf. numpy.polynomial.polynomial.polyval), but the corresponding root finding routine (numpy.polynomial.polynomial.polyroots) only deals with one polynomial at a time. The present code in principle could be used to speed up the latter routine after it is generalized to multiple polynomials. The general case is probably doable just by coding up the companion matrix approach using low-level routines (or maybe with the new vectorized np.linalg routines). Some relevant code elsewhere for similar purposes can be found in scipy.inteprolate.PPoly/BPoly (and in the future BSpline). https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.PPoly... However, since it's piecewise, there's purposefully support only for real-valued roots. -- Pauli Virtanen
On 30.09.2015 13:21, Pauli Virtanen wrote:
Juha Jeronen <juha.jeronen <at> jyu.fi> writes:
I recently developed a Cython-based, OpenMP-accelerated quartic (and cubic, quadratic) polynomial solver to address a personal research need for quickly solving a very large number of independent low-degree polynomial equations for both real and complex coefficients. My 2c in this context would be to also think how this best fits in how collections of polynomials are represented in Numpy.
AFAICS, Numpy's polynomial module supports evaluation of polynomial collections (cf. numpy.polynomial.polynomial.polyval), but the corresponding root finding routine (numpy.polynomial.polynomial.polyroots) only deals with one polynomial at a time.
The present code in principle could be used to speed up the latter routine after it is generalized to multiple polynomials. The general case is probably doable just by coding up the companion matrix approach using low-level routines (or maybe with the new vectorized np.linalg routines).
Some relevant code elsewhere for similar purposes can be found in scipy.inteprolate.PPoly/BPoly (and in the future BSpline).
https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.PPoly...
However, since it's piecewise, there's purposefully support only for real-valued roots.
Thanks. It sounds like numpy.polynomial.polynomial.polyroots() is a good candidate place for this, if as you suggest, it is first generalized to handle multiple polynomials. -J
participants (12)
-
Charles R Harris
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Daπid
-
Juha Jeronen
-
Matthieu Brucher
-
Nathan Goldbaum
-
Nathaniel Smith
-
Pauli Virtanen
-
Robert Kern
-
Stephan Hoyer
-
Sturla Molden