Dear all, Following these posts: http://stackoverflow.com/questions/10489134/multithreaded-calls-to-the-objec... It seems it is possible to make leastsq take part of multiple processors. I was wondering: given that the tendency of processors is to have more and more cores nowadays, why this is not done by default in leastsq? Best regards, Frédéric Parrenin
On Thu, Dec 19, 2013 at 4:55 AM, Frédéric Parrenin <parrenin@ujf-grenoble.fr
wrote:
Dear all,
Following these posts:
http://stackoverflow.com/questions/10489134/multithreaded-calls-to-the-objec... It seems it is possible to make leastsq take part of multiple processors.
I was wondering: given that the tendency of processors is to have more and more cores nowadays, why this is not done by default in leastsq?
I think parallelizing leastsq would almost always be the wrong place to parallelize. Even the loop over j that Pauli mentions is in the user function, and leastsq cannot assume that this works, since there are many applications where the calculations for different j's are not independent of each other. Using parallelization in the wrong spot can hurt performance instead of improving it. https://groups.google.com/d/msg/pystatsmodels/3X1LlY9U3Yc/7FDXWEADBUIJ Josef
Best regards,
Frédéric Parrenin
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
On Thu, Dec 19, 2013 at 6:24 AM, <josef.pktd@gmail.com> wrote:
On Thu, Dec 19, 2013 at 4:55 AM, Frédéric Parrenin <parrenin@ujf-grenoble.fr> wrote:
Dear all,
Following these posts:
http://stackoverflow.com/questions/10489134/multithreaded-calls-to-the-objec... It seems it is possible to make leastsq take part of multiple processors.
I was wondering: given that the tendency of processors is to have more and more cores nowadays, why this is not done by default in leastsq?
I think parallelizing leastsq would almost always be the wrong place to parallelize.
I am slightly reluctant to speak up, but I think this may not be true. For calls to leastsq() with finite-difference Jacobians, MINPACK's lmdif() calls fdjac2() in each iteration. This subroutine then calls the users objective function N times (for N variables) in a simple loop with slightly different values for the variables. Although these calls share a work array this is an implementation detail and the array elements per variable are actually independent. This loop of N calls to the objective function per iteration would be a good candidate for a multiprocessing pool, and doing so could give a substantial speed up for problems with more than a couple variables and where the calculation of the objective function is the bottleneck (which is typical for all but simple examples). Currently, scipy's leastsq() simply calls the Fortran lmdif() (for finite-diff Jacobian). I think replacing fdjac2() with a multiprocessing version would require reimplementing both lmdif() and fdjac2(), probably using cython. If calls to MINPACKs lmpar() and qrfac() could be left untouched, this translation does not look too insane -- the two routines lmdif() and fdjac2() themselves are not that complicated. It would be a fair amount of work, and I cannot volunteer to do this myself any time soon. But, I do think it actually would improve the speed of leastsq() for many use cases. Hoping this will inspire someone..... --Matt
Matt Newville wrote:
Currently, scipy's leastsq() simply calls the Fortran lmdif() (for finite-diff Jacobian). I think replacing fdjac2() with a multiprocessing version would require reimplementing both lmdif() and fdjac2(), probably using cython. If calls to MINPACKs lmpar() and qrfac() could be left untouched, this translation does not look too insane -- the two routines lmdif() and fdjac2() themselves are not that complicated. It would be a fair amount of work, and I cannot volunteer to do this myself any time soon. But, I do think it actually would improve the speed of leastsq() for many use cases.
Computing the Jacobian using using multiprocessing definitely helps the speed. I wrote the unrated answer (xioxox) there which shows how to do it in Python. Jeremy
Jeremy, On Fri, Dec 20, 2013 at 6:43 AM, Jeremy Sanders <jeremy@jeremysanders.net> wrote:
Matt Newville wrote:
Currently, scipy's leastsq() simply calls the Fortran lmdif() (for finite-diff Jacobian). I think replacing fdjac2() with a multiprocessing version would require reimplementing both lmdif() and fdjac2(), probably using cython. If calls to MINPACKs lmpar() and qrfac() could be left untouched, this translation does not look too insane -- the two routines lmdif() and fdjac2() themselves are not that complicated. It would be a fair amount of work, and I cannot volunteer to do this myself any time soon. But, I do think it actually would improve the speed of leastsq() for many use cases.
Computing the Jacobian using using multiprocessing definitely helps the speed. I wrote the unrated answer (xioxox) there which shows how to do it in Python.
Jeremy
Sorry, I hadn't read the stackoverflow discussion carefully enough. You're right that this is the same basic approach, and your suggestion is much easier to implement. I think having helper functions to automatically provide this functionality would be really great. -- --Matt
On Fri, Dec 20, 2013 at 7:09 AM, Matt Newville <newville@cars.uchicago.edu> wrote:
Jeremy,
On Fri, Dec 20, 2013 at 6:43 AM, Jeremy Sanders <jeremy@jeremysanders.net> wrote:
Matt Newville wrote:
Currently, scipy's leastsq() simply calls the Fortran lmdif() (for finite-diff Jacobian). I think replacing fdjac2() with a multiprocessing version would require reimplementing both lmdif() and fdjac2(), probably using cython. If calls to MINPACKs lmpar() and qrfac() could be left untouched, this translation does not look too insane -- the two routines lmdif() and fdjac2() themselves are not that complicated. It would be a fair amount of work, and I cannot volunteer to do this myself any time soon. But, I do think it actually would improve the speed of leastsq() for many use cases.
Computing the Jacobian using using multiprocessing definitely helps the speed. I wrote the unrated answer (xioxox) there which shows how to do it in Python.
Jeremy
Sorry, I hadn't read the stackoverflow discussion carefully enough. You're right that this is the same basic approach, and your suggestion is much easier to implement. I think having helper functions to automatically provide this functionality would be really great.
I've implemented such an ability to use a multiprocessing Pool for leastsq() in a way that I think is suitable for scipy. This is currently at https://github.com/newville/scipy/commit/3d0ac1da3bcd1d34a1bec8226ea0284f04f... it adds an "mp_pool" argument to leastsq() which, if not None and if Dfun is not otherwise defined, will provide a Dfun that uses the provided multiprocessing Pool. It requires that user to make and manage the multiprocessing Pool rather than try to manage it inside leastsq(). I think this is ready for a PR, but would happily take comments on it. I do notice that for small test programs, such as the test_minpack.py suite, this approach (which is basically a generalization of Jeremy's implementation) is significantly slower (~10x) than not using multiprocessing. This slow-down seems entirely to be due to replaced the Fortran subroutine fdjac2 with Python function Dfun. I imagine that real performance will be highly variable, and have not explored (yet?) whether using Cython might help here. I expect it would, but have no experience using Cython and multiprocessing together. One caveat of the multiprocessing approach is that the objective function must be Pickleable, which can be challenging in many real-world situations, say where the objective function is an instance method. Solutions using copy_reg() are reported to work, but I couldn't get this to work for the test_minpack.py suite, and so that only tests using a plain function for the objective function. Is this addition worth including in leastsq()? I would think that it does little harm, might be useful for some, and provides a starting point for further work. Cheers, --Matt
Matt Newville <newville@cars.uchicago.edu> wrote:
Is this addition worth including in leastsq()? I would think that it does little harm, might be useful for some, and provides a starting point for further work.
I believe the "right place" to start vectorizing leastsq would be to use LAPACK for the QR factorization, and then leave the parallel computing to MKL or OpenBLAS. But if you do, it would be just as easy to write a Levenberg-Marquardt method from scratch rather than to patch MINPACK. Sturla
Dear all, Coming back to an old thread... I tried Jeremy's method since it is the easiest to implement. Below is the Dfun function I provided to leastsq. In my experiment, I used a pool of 6 since I have 8 cores in my PC. However, the computer becomes extremely slow, almost unusable, during the experiment. Do you know why this happens? Best regards, Frédéric ________________ def Dres(var): """Calculate derivatives for each parameter using pool.""" zeropred = residuals(var) derivparams = [] results=[] delta = m.sqrt(np.finfo(float).eps) #Stolen from the leastsq code for i in range(len(var)): copy = np.array(var) copy[i] += delta derivparams.append(copy) # results.append(residuals(derivparams)) if __name__ == "__main__": pool = multiprocessing.Pool(nb_nodes) results = pool.map(residuals, derivparams) derivs = [ (r - zeropred)/delta for r in results ] return derivs 2013-12-20 13:43 GMT+01:00 Jeremy Sanders <jeremy@jeremysanders.net>:
Matt Newville wrote:
Currently, scipy's leastsq() simply calls the Fortran lmdif() (for finite-diff Jacobian). I think replacing fdjac2() with a multiprocessing version would require reimplementing both lmdif() and fdjac2(), probably using cython. If calls to MINPACKs lmpar() and qrfac() could be left untouched, this translation does not look too insane -- the two routines lmdif() and fdjac2() themselves are not that complicated. It would be a fair amount of work, and I cannot volunteer to do this myself any time soon. But, I do think it actually would improve the speed of leastsq() for many use cases.
Computing the Jacobian using using multiprocessing definitely helps the speed. I wrote the unrated answer (xioxox) there which shows how to do it in Python.
Jeremy
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
Hi Frederic, On Thu, May 22, 2014 at 10:20 AM, Frédéric Parrenin < parrenin@ujf-grenoble.fr> wrote:
Dear all,
Coming back to an old thread...
I tried Jeremy's method since it is the easiest to implement. Below is the Dfun function I provided to leastsq. In my experiment, I used a pool of 6 since I have 8 cores in my PC.
However, the computer becomes extremely slow, almost unusable, during the experiment. Do you know why this happens?
Best regards,
Frédéric
Yes, my observation, based on the code at https://github.com/newville/scipy/commit/3d0ac1da3bcd1d34a1bec8226ea0284f04f... was that there was about a 10x performance hit. So, similar to your observations. This approach assumes that the cost of setting up multiple processes is small compared to the execution time of the objective function itself. It also assumes that having a Jacobian function in Python (as compared to Fortran) is a small performance hit. Again, this is more likely to be true for a time-consuming objective function, and almost certainly not true for any small test case. I could be persuaded that this approach is still a reasonable idea, but (at least if implemented in pure Python) all the evidence is that it is much slower. Using Cython may help, but I have not tried this. Any multiprocessing approach that includes calling the objective function from different processes is going to be limited by the "picklability" issue. To me, this is a fairly significant limitation. I've been lead to believe that the Mystic framework may have worked around this problem, but I don't know the details. Others have suggested that doing the QR factorization with multiprocessing would be the better approach. This seems worth trying, but, In my experience, the bulk of the time is actually spent in the objective function. --Matt Newville
Answering to my own question: Actually, the same code runs on debian 7 instead of ubuntu 13.10 does not slow down my computer. So this may be an ubuntu-specific problem. For the gain, my program runs in 545 s on one core and in 123 s using 10 cores. So it seems there is a factor of 2 performance hit in this case, this is not two bad. Best regards, Frédéric Parrenin 2014-05-23 4:23 GMT+02:00 Matt Newville <newville@cars.uchicago.edu>:
Hi Frederic,
On Thu, May 22, 2014 at 10:20 AM, Frédéric Parrenin < parrenin@ujf-grenoble.fr> wrote:
Dear all,
Coming back to an old thread...
I tried Jeremy's method since it is the easiest to implement. Below is the Dfun function I provided to leastsq. In my experiment, I used a pool of 6 since I have 8 cores in my PC.
However, the computer becomes extremely slow, almost unusable, during the experiment. Do you know why this happens?
Best regards,
Frédéric
Yes, my observation, based on the code at
https://github.com/newville/scipy/commit/3d0ac1da3bcd1d34a1bec8226ea0284f04f...
was that there was about a 10x performance hit. So, similar to your observations.
This approach assumes that the cost of setting up multiple processes is small compared to the execution time of the objective function itself. It also assumes that having a Jacobian function in Python (as compared to Fortran) is a small performance hit. Again, this is more likely to be true for a time-consuming objective function, and almost certainly not true for any small test case.
I could be persuaded that this approach is still a reasonable idea, but (at least if implemented in pure Python) all the evidence is that it is much slower. Using Cython may help, but I have not tried this.
Any multiprocessing approach that includes calling the objective function from different processes is going to be limited by the "picklability" issue. To me, this is a fairly significant limitation. I've been lead to believe that the Mystic framework may have worked around this problem, but I don't know the details.
Others have suggested that doing the QR factorization with multiprocessing would be the better approach. This seems worth trying, but, In my experience, the bulk of the time is actually spent in the objective function.
--Matt Newville
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
Actually, the parallel leastsq code is very unstable on both debian 7 and ubuntu 13.10. Sometimes it works, sometimes it freezes my computer. I would be glad if anybody could explain to me the origin of this problem. Best regards, Frédéric Parrenin 2014-05-23 8:52 GMT+02:00 Frédéric Parrenin <parrenin@ujf-grenoble.fr>:
Answering to my own question: Actually, the same code runs on debian 7 instead of ubuntu 13.10 does not slow down my computer. So this may be an ubuntu-specific problem.
For the gain, my program runs in 545 s on one core and in 123 s using 10 cores. So it seems there is a factor of 2 performance hit in this case, this is not two bad.
Best regards,
Frédéric Parrenin
2014-05-23 4:23 GMT+02:00 Matt Newville <newville@cars.uchicago.edu>:
Hi Frederic,
On Thu, May 22, 2014 at 10:20 AM, Frédéric Parrenin < parrenin@ujf-grenoble.fr> wrote:
Dear all,
Coming back to an old thread...
I tried Jeremy's method since it is the easiest to implement. Below is the Dfun function I provided to leastsq. In my experiment, I used a pool of 6 since I have 8 cores in my PC.
However, the computer becomes extremely slow, almost unusable, during the experiment. Do you know why this happens?
Best regards,
Frédéric
Yes, my observation, based on the code at
https://github.com/newville/scipy/commit/3d0ac1da3bcd1d34a1bec8226ea0284f04f...
was that there was about a 10x performance hit. So, similar to your observations.
This approach assumes that the cost of setting up multiple processes is small compared to the execution time of the objective function itself. It also assumes that having a Jacobian function in Python (as compared to Fortran) is a small performance hit. Again, this is more likely to be true for a time-consuming objective function, and almost certainly not true for any small test case.
I could be persuaded that this approach is still a reasonable idea, but (at least if implemented in pure Python) all the evidence is that it is much slower. Using Cython may help, but I have not tried this.
Any multiprocessing approach that includes calling the objective function from different processes is going to be limited by the "picklability" issue. To me, this is a fairly significant limitation. I've been lead to believe that the Mystic framework may have worked around this problem, but I don't know the details.
Others have suggested that doing the QR factorization with multiprocessing would be the better approach. This seems worth trying, but, In my experience, the bulk of the time is actually spent in the objective function.
--Matt Newville
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
I think this is fundamentally the wrong approach to a parallel leastsq. We should replace the MINPACK supplied QR-solver with one based on LAPACK. Then MKL, Accelerate or OpenBLAS will take care of the parallel processing. This is often the dominating part of the computation. So just parallel processing in the Python callbacks will not be very scalable. If you really care about a parallel leastsq, this is where you should put your effort. The computational compexity here is O(N**3), compared to O(N) for the callbacks. The bigger the problem, the more the QR part will dominate. As for the callback functions that produces residuals and Jacobian, the easiest solution would be a prange in Cython or Numba, or use Python threads and release the GIL. I would not use multiprocessing without shared memory here, because otherwise the IPC overhead will be too big. The functions that compute the residuals and Jacobian are called repeatedly. The major IPC overhead is multiprocessing's internal use of pickle to serialize the ndarrays, not the communication over the pipes. I would instead just copy data to and from shared memory. You can find a shared memory system that works with multiprocessing on https://github.com/sturlamolden/sharedmem-numpy Note that it does not remove the pickle overhead, so you should reuse the shared memory arrays in the Python callbacks. This way the IPC overhead will be reduced to a memcpy. Sturla Frédéric Parrenin <parrenin@ujf-grenoble.fr> wrote:
Actually, the parallel leastsq code is very unstable on both debian 7 and ubuntu 13.10. Sometimes it works, sometimes it freezes my computer.
I would be glad if anybody could explain to me the origin of this problem.
Best regards,
Frédéric Parrenin
2014-05-23 8:52 GMT+02:00 Frédéric Parrenin <parrenin@ujf-grenoble.fr>:
Answering to my own question: Actually, the same code runs on debian 7 instead of ubuntu 13.10 does not slow down my computer. So this may be an ubuntu-specific problem.
For the gain, my program runs in 545 s on one core and in 123 s using 10 cores. So it seems there is a factor of 2 performance hit in this case, this is not two bad.
Best regards,
Frédéric Parrenin
2014-05-23 4:23 GMT+02:00 Matt Newville <newville@cars.uchicago.edu>:
Hi Frederic,
On Thu, May 22, 2014 at 10:20 AM, Frédéric Parrenin < parrenin@ujf-grenoble.fr> wrote:
Dear all,
Coming back to an old thread...
I tried Jeremy's method since it is the easiest to implement. Below is the Dfun function I provided to leastsq. In my experiment, I used a pool of 6 since I have 8 cores in my PC.
However, the computer becomes extremely slow, almost unusable, during the experiment. Do you know why this happens?
Best regards,
Frédéric
Yes, my observation, based on the code at
was that there was about a 10x performance hit. So, similar to your observations.
This approach assumes that the cost of setting up multiple processes is small compared to the execution time of the objective function itself. It also assumes that having a Jacobian function in Python (as compared to Fortran) is a small performance hit. Again, this is more likely to be true for a time-consuming objective function, and almost certainly not true for any small test case.
I could be persuaded that this approach is still a reasonable idea, but (at least if implemented in pure Python) all the evidence is that it is much slower. Using Cython may help, but I have not tried this.
Any multiprocessing approach that includes calling the objective function from different processes is going to be limited by the "picklability" issue. To me, this is a fairly significant limitation. I've been lead to believe that the Mystic framework may have worked around this problem, but I don't know the details.
Others have suggested that doing the QR factorization with multiprocessing would be the better approach. This seems worth trying, but, In my experience, the bulk of the time is actually spent in the objective function.
--Matt Newville
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org <a href="http://mail.scipy.org/mailman/listinfo/scipy-user">http://mail.scipy.org/mailman/listinfo/scipy-user</a>
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org <a href="http://mail.scipy.org/mailman/listinfo/scipy-user">http://mail.scipy.org/mailman/listinfo/scipy-user</a>
Sturla Molden <sturla.molden@gmail.com> wrote:
The computational compexity here is O(N**3), compared to O(N) for the callbacks. The bigger the problem, the more the QR part will dominate.
In other words, as the size of the problems increases, the speed improvement from using multiprocessing will go towards 0. You will have no benefit from using multiprocessing with leastsq when you really need it. So put the parallel computing in leastsq in the right place. In the long run, it would be better to replace the Fortran MINPACK leastsq with one that uses LAPACK *GELS driver for each linearized least squares fit. (BTW: Using *GELSS (scipy.linalg.lstsq) kind of defies the purpose of using a regularization in the linearized least squares fits, as the SVD solves this problem automatically.) Sturla
Hi Sturla, On Thu, May 29, 2014 at 11:17 AM, Sturla Molden <sturla.molden@gmail.com> wrote:
I think this is fundamentally the wrong approach to a parallel leastsq. We should replace the MINPACK supplied QR-solver with one based on LAPACK. Then MKL, Accelerate or OpenBLAS will take care of the parallel
processing.
This is often the dominating part of the computation. So just parallel processing in the Python callbacks will not be very scalable. If you really care about a parallel leastsq, this is where you should put your effort. The computational compexity here is O(N**3), compared to O(N) for the callbacks. The bigger the problem, the more the QR part will dominate.
I agree that, at some order of magnitude, QR will dominate. I'm not sure that this is "often the dominating part". Do you have any data on this? Certainly the crossover will depend on the complexity of the objective function, no?
As for the callback functions that produces residuals and Jacobian, the easiest solution would be a prange in Cython or Numba, or use Python threads and release the GIL. I would not use multiprocessing without
shared
memory here, because otherwise the IPC overhead will be too big. The functions that compute the residuals and Jacobian are called repeatedly. The major IPC overhead is multiprocessing's internal use of pickle to serialize the ndarrays, not the communication over the pipes. I would instead just copy data to and from shared memory. You can find a shared memory system that works with multiprocessing on https://github.com/sturlamolden/sharedmem-numpy Note that it does not remove the pickle overhead, so you should reuse the shared memory arrays in the Python callbacks. This way the IPC overhead will be reduced to a memcpy.
Sturla
That seems reasonable for static ndarray data. But I think the bigger overhead may be function calls. For me, the bigger issue is requiring pickle-able objects, especially instance methods. That is, I'd like to create a Data or Model object and pass that into a fitting function, using methods a part of the residual calculation. Those instances would need to be pickleable to work with multiprocessing. I don't know how to that in a general way. --Matt On May 29, 2014 11:17 AM, "Sturla Molden" <sturla.molden@gmail.com> wrote:
I think this is fundamentally the wrong approach to a parallel leastsq. We should replace the MINPACK supplied QR-solver with one based on LAPACK. Then MKL, Accelerate or OpenBLAS will take care of the parallel processing. This is often the dominating part of the computation. So just parallel processing in the Python callbacks will not be very scalable. If you really care about a parallel leastsq, this is where you should put your effort. The computational compexity here is O(N**3), compared to O(N) for the callbacks. The bigger the problem, the more the QR part will dominate.
As for the callback functions that produces residuals and Jacobian, the easiest solution would be a prange in Cython or Numba, or use Python threads and release the GIL. I would not use multiprocessing without shared memory here, because otherwise the IPC overhead will be too big. The functions that compute the residuals and Jacobian are called repeatedly. The major IPC overhead is multiprocessing's internal use of pickle to serialize the ndarrays, not the communication over the pipes. I would instead just copy data to and from shared memory. You can find a shared memory system that works with multiprocessing on https://github.com/sturlamolden/sharedmem-numpy Note that it does not remove the pickle overhead, so you should reuse the shared memory arrays in the Python callbacks. This way the IPC overhead will be reduced to a memcpy.
Sturla
Actually, the parallel leastsq code is very unstable on both debian 7 and ubuntu 13.10. Sometimes it works, sometimes it freezes my computer.
I would be glad if anybody could explain to me the origin of this
Frédéric Parrenin <parrenin@ujf-grenoble.fr> wrote: problem.
Best regards,
Frédéric Parrenin
2014-05-23 8:52 GMT+02:00 Frédéric Parrenin <parrenin@ujf-grenoble.fr>:
Answering to my own question: Actually, the same code runs on debian 7 instead of ubuntu 13.10 does
slow down my computer. So this may be an ubuntu-specific problem.
For the gain, my program runs in 545 s on one core and in 123 s using 10 cores. So it seems there is a factor of 2 performance hit in this case, this is not two bad.
Best regards,
Frédéric Parrenin
2014-05-23 4:23 GMT+02:00 Matt Newville <newville@cars.uchicago.edu>:
Hi Frederic,
On Thu, May 22, 2014 at 10:20 AM, Frédéric Parrenin < parrenin@ujf-grenoble.fr> wrote:
Dear all,
Coming back to an old thread...
I tried Jeremy's method since it is the easiest to implement. Below is the Dfun function I provided to leastsq. In my experiment, I used a pool of 6 since I have 8 cores in my PC.
However, the computer becomes extremely slow, almost unusable, during the experiment. Do you know why this happens?
Best regards,
Frédéric
Yes, my observation, based on the code at
<a href=" https://github.com/newville/scipy/commit/3d0ac1da3bcd1d34a1bec8226ea0284f04f... "> https://github.com/newville/scipy/commit/3d0ac1da3bcd1d34a1bec8226ea0284f04f... </a>
was that there was about a 10x performance hit. So, similar to your observations.
This approach assumes that the cost of setting up multiple processes is small compared to the execution time of the objective function itself. It also assumes that having a Jacobian function in Python (as compared to Fortran) is a small performance hit. Again, this is more likely to be
not true
for a time-consuming objective function, and almost certainly not true for any small test case.
I could be persuaded that this approach is still a reasonable idea, but (at least if implemented in pure Python) all the evidence is that it is much slower. Using Cython may help, but I have not tried this.
Any multiprocessing approach that includes calling the objective function from different processes is going to be limited by the "picklability" issue. To me, this is a fairly significant limitation. I've been lead to believe that the Mystic framework may have worked around this problem, but I don't know the details.
Others have suggested that doing the QR factorization with multiprocessing would be the better approach. This seems worth trying, but, In my experience, the bulk of the time is actually spent in the objective function.
--Matt Newville
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org <a href="http://mail.scipy.org/mailman/listinfo/scipy-user"> http://mail.scipy.org/mailman/listinfo/scipy-user</a>
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org <a href="http://mail.scipy.org/mailman/listinfo/scipy-user"> http://mail.scipy.org/mailman/listinfo/scipy-user</a>
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
Sturla Molden wrote:
I think this is fundamentally the wrong approach to a parallel leastsq. We should replace the MINPACK supplied QR-solver with one based on LAPACK. Then MKL, Accelerate or OpenBLAS will take care of the parallel processing. This is often the dominating part of the computation. So just parallel processing in the Python callbacks will not be very scalable. If you really care about a parallel leastsq, this is where you should put your effort. The computational compexity here is O(N**3), compared to O(N) for the callbacks. The bigger the problem, the more the QR part will dominate.
This is certainly not true for many of the problems I've been recently solving. Calculating the objective function is very slow and my models only have a few tens of parameters. The IPC is minimal here, so I get get pretty good utilisation of several cores. I think support for both parallel processing optimisations is necessary. Jeremy
On 12/19/2013 10:55 AM, Frédéric Parrenin wrote:
Dear all,
Following these posts: http://stackoverflow.com/questions/10489134/multithreaded-calls-to-the-objec... It seems it is possible to make leastsq take part of multiple processors.
I was wondering: given that the tendency of processors is to have more and more cores nowadays, why this is not done by default in leastsq?
Best regards,
Frédéric Parrenin
_______________________________________________ SciPy-User mailing list SciPy-User@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-user
Folks, whatever is the outcome of the discussion, please always make multithreading/-processing a configurable option. Like: I want to be able to turn it off and have numpy/scipy use only 1 thread/core. Most of my programs already use multicore processing, in which case having each process doing 'second level' multicore stuff internally would be very counterproductive. Like, I have 48 cores, and thus 48 subprocesses of my program spawned doing calculations. When each of those also tries to spawn 48 threads/processes to optimize some leastsq problem, in the worst case I'll have 2304 (48*48) threads fighting for 48 cpu's... E.g. I also always have the num threads for openblas set to 1. Best, Vincent.
participants (6)
-
Frédéric Parrenin -
Jeremy Sanders -
josef.pktd@gmail.com -
Matt Newville -
Sturla Molden -
Vincent Schut