directional derivatives in Wolfe line search

Hi SciPy-Dev, I have been using scipy.optimize.fmin_bfgs on some derivative-free benchmark problems, and noticed that whenever the Wolfe line search requires a directional derivative, the current implementation estimates the entire gradient via finite differencing, and then computes the directional derivative by taking the inner product of the gradient and the search direction. In my experiments, replacing this full gradient estimation with a single extra function evaluation to estimate the directional derivative directly, is faster. The key places this happens in the code are: https://github.com/scipy/scipy/blob/701ffcc8a6f04509d115aac5e5681c538b5265a2... <https://github.com/scipy/scipy/blob/701ffcc8a6f04509d115aac5e5681c538b5265a2...> https://github.com/scipy/scipy/blob/701ffcc8a6f04509d115aac5e5681c538b5265a2... <https://github.com/scipy/scipy/blob/701ffcc8a6f04509d115aac5e5681c538b5265a2...> Both of the Wolfe line search implementations take an input function fprime that computes/estimates the full gradient, even though only derphi, the directional derivative, is used. What I’d like to do is have an option for fprime to be either provided or not provided to the Wolfe line search. If the objective has a nice/cheap gradient then the current behavior is fine (passing the gradient function as fprime, and computing directional derivatives with an inner product), but if the objective is derivative-free then derphi should be computed with finite differencing along the search direction (just one extra function evaluation) instead of using fprime. Do people agree this would be a good change? If so I can make a pull request. Best, Sara

On Wed, 28 Jul 2021 at 03:24, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
I have been using scipy.optimize.fmin_bfgs on some derivative-free benchmark problems, and noticed that whenever the Wolfe line search requires a directional derivative, the current implementation estimates the entire gradient via finite differencing, and then computes the directional derivative by taking the inner product of the gradient and the search direction. In my experiments, replacing this full gradient estimation with a single extra function evaluation to estimate the directional derivative directly, is faster.
What I’d like to do is have an option for fprime to be either provided or not provided to the Wolfe line search. If the objective has a nice/cheap gradient then the current behavior is fine (passing the gradient function as fprime, and computing directional derivatives with an inner product), but if the objective is derivative-free then derphi should be computed with finite differencing along the search direction (just one extra function evaluation) instead of using fprime.
Estimating gradients with finite differences relies on a good choice of step size. Good step defaults are automatically chosen by `optimize._numdiff.approx_derivative` when fprime is estimated numerically. Estimating derphi with numerical differentiation along the search direction would first of all require a good step size along the search direction, `pk`. Whilst this may be ok if the parameter scaling, pk, and the derivatives are well chosen/well behaved (e.g. all the parameters are the same magnitude, etc), I'm concerned that there will be cases where it won't be as numerically accurate/stable as the existing behaviour. For example, a chosen step along pk may result in individual dx that aren't optimal from a numerical differentiation viewpoint. How would one know if a specific system was exhibiting that behaviour? I'd expect the current code to be more robust than your proposed alternative. Having said that, I'm not an expert in this domain, so I'd be interested to hear what someone who is more expert than me has to say. Can you point to any literature that says that your proposed changes are generally acceptable? Andrew.

Hi Andrew (and others), Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy. My experiment is on the DFO benchmark of Moré and Wild (https://www.mcs.anl.gov/~more/dfo/ <https://www.mcs.anl.gov/~more/dfo/>), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient. The blue curve is the same as the orange one, but replacing the forward finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy (https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448 <https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448>, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same). The plot is a performance profile, where the y axis is the fraction of benchmark problems solved to a desired precision and the x axis is the performance ratio, which is the ratio of the number of function evaluations used by each algorithm compared to the number of function evaluations used by the most efficient algorithm on each problem. For instance, in this experiment the current implementation is fastest on about 42% of the benchmark problems, and the proposed version is fastest on about 58% of the benchmark problems. The proposed implementation never uses more than 3x the fewest number of function evaluations to converge, whereas on about 9% of the benchmark problems the current implementation fails to converge within 10x the number of function evaluations. As far as literature/theory, I’m not aware of this exact comparison (which is part of why I set up this experiment). My instinct is that the line search subroutine only really cares about the directional derivative in this specific search direction, so I would expect that estimating that directly might even have some benefits compared to taking samples in standard basis directions and then taking an inner product. Certainly there is some sensitivity to the sampling radius, but this is true even in the case of estimating the full gradient (I’m not sure I would expect it to be worse for just estimating the directional derivative). If anyone does have specific references/experience, please chime in! Is there a standard benchmark test suite that scipy uses to make these kinds of algorithm implementation decisions? (It would also make sense to me to keep both options around, have a default, and let users deviate from it if they want.) Best, Sara
On Jul 27, 2021, at 7:24 PM, Andrew Nelson <andyfaff@gmail.com> wrote:
On Wed, 28 Jul 2021 at 03:24, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote: I have been using scipy.optimize.fmin_bfgs on some derivative-free benchmark problems, and noticed that whenever the Wolfe line search requires a directional derivative, the current implementation estimates the entire gradient via finite differencing, and then computes the directional derivative by taking the inner product of the gradient and the search direction. In my experiments, replacing this full gradient estimation with a single extra function evaluation to estimate the directional derivative directly, is faster.
What I’d like to do is have an option for fprime to be either provided or not provided to the Wolfe line search. If the objective has a nice/cheap gradient then the current behavior is fine (passing the gradient function as fprime, and computing directional derivatives with an inner product), but if the objective is derivative-free then derphi should be computed with finite differencing along the search direction (just one extra function evaluation) instead of using fprime.
Estimating gradients with finite differences relies on a good choice of step size. Good step defaults are automatically chosen by `optimize._numdiff.approx_derivative` when fprime is estimated numerically. Estimating derphi with numerical differentiation along the search direction would first of all require a good step size along the search direction, `pk`. Whilst this may be ok if the parameter scaling, pk, and the derivatives are well chosen/well behaved (e.g. all the parameters are the same magnitude, etc), I'm concerned that there will be cases where it won't be as numerically accurate/stable as the existing behaviour. For example, a chosen step along pk may result in individual dx that aren't optimal from a numerical differentiation viewpoint. How would one know if a specific system was exhibiting that behaviour? I'd expect the current code to be more robust than your proposed alternative. Having said that, I'm not an expert in this domain, so I'd be interested to hear what someone who is more expert than me has to say. Can you point to any literature that says that your proposed changes are generally acceptable?
Andrew. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

On Wed, 28 Jul 2021 at 15:35, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy.
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
My experiment is on the DFO benchmark of Moré and Wild ( https://www.mcs.anl.gov/~more/dfo/), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation? The blue curve is the same as the orange one, but replacing the forward
finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy ( https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same).
Note that the finite differencing for most of the optimizers is done by approx_derivative ( https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257), which chooses the step size automatically. If `jac=None` for `_minimize_bfgs` then the default is for absolute steps, but you can use relative steps by specifying `jac='2-point', etc. I'm guessing that the modification would be: 1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2.

The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
I ended up translating the Moré and Wild CUTEST benchmark into python (from matlab) for my own experiment anyway, so I’d be happy to include that as a benchmark for scipy.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
It’s a bit nuanced; it looks like the current scipy implementation is faster than my orange implementation for some problems, but there are other problems where my orange implementation converges and the current scipy gets stuck. I suspect the former is probably because of the fancier finite differencing in scipy, and the latter might be because I added a little check in the BFGS update (since the benchmark problems are nonconvex): if BFGS ever tries to take an ascent direction, I replace it with steepest descent for that step, and reset the BFGS matrix to identity.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2.
Yes, I think this would be good. Best, Sara
On Jul 28, 2021, at 12:38 AM, Andrew Nelson <andyfaff@gmail.com> wrote:
On Wed, 28 Jul 2021 at 15:35, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote: Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy.
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
My experiment is on the DFO benchmark of Moré and Wild (https://www.mcs.anl.gov/~more/dfo/ <https://www.mcs.anl.gov/~more/dfo/>), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
The blue curve is the same as the orange one, but replacing the forward finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy (https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448 <https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448>, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same).
Note that the finite differencing for most of the optimizers is done by approx_derivative (https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257 <https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257>), which chooses the step size automatically. If `jac=None` for `_minimize_bfgs` then the default is for absolute steps, but you can use relative steps by specifying `jac='2-point', etc.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev

I just sent a PR for this, looking forward to feedback :) Best, Sara
On Jul 28, 2021, at 9:50 AM, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
I ended up translating the Moré and Wild CUTEST benchmark into python (from matlab) for my own experiment anyway, so I’d be happy to include that as a benchmark for scipy.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
It’s a bit nuanced; it looks like the current scipy implementation is faster than my orange implementation for some problems, but there are other problems where my orange implementation converges and the current scipy gets stuck. I suspect the former is probably because of the fancier finite differencing in scipy, and the latter might be because I added a little check in the BFGS update (since the benchmark problems are nonconvex): if BFGS ever tries to take an ascent direction, I replace it with steepest descent for that step, and reset the BFGS matrix to identity.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2.
Yes, I think this would be good.
Best, Sara
On Jul 28, 2021, at 12:38 AM, Andrew Nelson <andyfaff@gmail.com <mailto:andyfaff@gmail.com>> wrote:
On Wed, 28 Jul 2021 at 15:35, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote: Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy.
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
My experiment is on the DFO benchmark of Moré and Wild (https://www.mcs.anl.gov/~more/dfo/ <https://www.mcs.anl.gov/~more/dfo/>), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
The blue curve is the same as the orange one, but replacing the forward finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy (https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448 <https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448>, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same).
Note that the finite differencing for most of the optimizers is done by approx_derivative (https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257 <https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257>), which chooses the step size automatically. If `jac=None` for `_minimize_bfgs` then the default is for absolute steps, but you can use relative steps by specifying `jac='2-point', etc.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org <mailto:SciPy-Dev@python.org> https://mail.python.org/mailman/listinfo/scipy-dev

Hi all, Is anyone available to wrap up reviewing my PR (https://github.com/scipy/scipy/pull/14579 <https://github.com/scipy/scipy/pull/14579>)? I think I’ve now resolved everything from our earlier discussion. Thanks! Sara
On Aug 12, 2021, at 12:05 AM, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
I just sent a PR for this, looking forward to feedback :)
Best, Sara
On Jul 28, 2021, at 9:50 AM, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote:
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
I ended up translating the Moré and Wild CUTEST benchmark into python (from matlab) for my own experiment anyway, so I’d be happy to include that as a benchmark for scipy.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
It’s a bit nuanced; it looks like the current scipy implementation is faster than my orange implementation for some problems, but there are other problems where my orange implementation converges and the current scipy gets stuck. I suspect the former is probably because of the fancier finite differencing in scipy, and the latter might be because I added a little check in the BFGS update (since the benchmark problems are nonconvex): if BFGS ever tries to take an ascent direction, I replace it with steepest descent for that step, and reset the BFGS matrix to identity.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2.
Yes, I think this would be good.
Best, Sara
On Jul 28, 2021, at 12:38 AM, Andrew Nelson <andyfaff@gmail.com <mailto:andyfaff@gmail.com>> wrote:
On Wed, 28 Jul 2021 at 15:35, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote: Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy.
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
My experiment is on the DFO benchmark of Moré and Wild (https://www.mcs.anl.gov/~more/dfo/ <https://www.mcs.anl.gov/~more/dfo/>), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
The blue curve is the same as the orange one, but replacing the forward finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy (https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448 <https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448>, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same).
Note that the finite differencing for most of the optimizers is done by approx_derivative (https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257 <https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257>), which chooses the step size automatically. If `jac=None` for `_minimize_bfgs` then the default is for absolute steps, but you can use relative steps by specifying `jac='2-point', etc.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org <mailto:SciPy-Dev@python.org> https://mail.python.org/mailman/listinfo/scipy-dev <https://mail.python.org/mailman/listinfo/scipy-dev>

Hi all, Just bumping my earlier message. I think everything should be resolved now and github says all checks are passing. If anyone has a moment to take a look and merge this in (or give more feedback) that would be great! Thanks, Sara
On Oct 3, 2021, at 12:13 PM, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
Hi all,
Is anyone available to wrap up reviewing my PR (https://github.com/scipy/scipy/pull/14579 <https://github.com/scipy/scipy/pull/14579>)? I think I’ve now resolved everything from our earlier discussion.
Thanks! Sara
On Aug 12, 2021, at 12:05 AM, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote:
I just sent a PR for this, looking forward to feedback :)
Best, Sara
On Jul 28, 2021, at 9:50 AM, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote:
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
I ended up translating the Moré and Wild CUTEST benchmark into python (from matlab) for my own experiment anyway, so I’d be happy to include that as a benchmark for scipy.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
It’s a bit nuanced; it looks like the current scipy implementation is faster than my orange implementation for some problems, but there are other problems where my orange implementation converges and the current scipy gets stuck. I suspect the former is probably because of the fancier finite differencing in scipy, and the latter might be because I added a little check in the BFGS update (since the benchmark problems are nonconvex): if BFGS ever tries to take an ascent direction, I replace it with steepest descent for that step, and reset the BFGS matrix to identity.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2.
Yes, I think this would be good.
Best, Sara
On Jul 28, 2021, at 12:38 AM, Andrew Nelson <andyfaff@gmail.com <mailto:andyfaff@gmail.com>> wrote:
On Wed, 28 Jul 2021 at 15:35, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote: Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy.
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
My experiment is on the DFO benchmark of Moré and Wild (https://www.mcs.anl.gov/~more/dfo/ <https://www.mcs.anl.gov/~more/dfo/>), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
The blue curve is the same as the orange one, but replacing the forward finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy (https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448 <https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448>, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same).
Note that the finite differencing for most of the optimizers is done by approx_derivative (https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257 <https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257>), which chooses the step size automatically. If `jac=None` for `_minimize_bfgs` then the default is for absolute steps, but you can use relative steps by specifying `jac='2-point', etc.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org <mailto:SciPy-Dev@python.org> https://mail.python.org/mailman/listinfo/scipy-dev <https://mail.python.org/mailman/listinfo/scipy-dev>

On Fri, Oct 8, 2021 at 9:28 PM Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
Hi all,
Just bumping my earlier message. I think everything should be resolved now and github says all checks are passing. If anyone has a moment to take a look and merge this in (or give more feedback) that would be great!
Thanks for your patience Sara! I'll try to look at it this weekend. Our PR backlog has grown to scary levels unfortunately .... Cheers, Ralf
Thanks, Sara
On Oct 3, 2021, at 12:13 PM, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
Hi all,
Is anyone available to wrap up reviewing my PR ( https://github.com/scipy/scipy/pull/14579)? I think I’ve now resolved everything from our earlier discussion.
Thanks! Sara
On Aug 12, 2021, at 12:05 AM, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
I just sent a PR for this, looking forward to feedback :)
Best, Sara
On Jul 28, 2021, at 9:50 AM, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
I ended up translating the Moré and Wild CUTEST benchmark into python (from matlab) for my own experiment anyway, so I’d be happy to include that as a benchmark for scipy.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
It’s a bit nuanced; it looks like the current scipy implementation is faster than my orange implementation for some problems, but there are other problems where my orange implementation converges and the current scipy gets stuck. I suspect the former is probably because of the fancier finite differencing in scipy, and the latter might be because I added a little check in the BFGS update (since the benchmark problems are nonconvex): if BFGS ever tries to take an ascent direction, I replace it with steepest descent for that step, and reset the BFGS matrix to identity.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2.
Yes, I think this would be good.
Best, Sara
On Jul 28, 2021, at 12:38 AM, Andrew Nelson <andyfaff@gmail.com> wrote:
On Wed, 28 Jul 2021 at 15:35, Sara Fridovich-Keil <sarafridov@gmail.com> wrote:
Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy.
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
My experiment is on the DFO benchmark of Moré and Wild ( https://www.mcs.anl.gov/~more/dfo/), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
The blue curve is the same as the orange one, but replacing the forward
finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy ( https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same).
Note that the finite differencing for most of the optimizers is done by approx_derivative ( https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257), which chooses the step size automatically. If `jac=None` for `_minimize_bfgs` then the default is for absolute steps, but you can use relative steps by specifying `jac='2-point', etc.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org To unsubscribe send an email to scipy-dev-leave@python.org https://mail.python.org/mailman3/lists/scipy-dev.python.org/ Member address: ralf.gommers@gmail.com

Hi Ralf, No worries at all! I’m not in a hurry I just wanted to make sure it didn’t get lost. Thanks, Sara
On Oct 8, 2021, at 12:38 PM, Ralf Gommers <ralf.gommers@gmail.com> wrote:
On Fri, Oct 8, 2021 at 9:28 PM Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote: Hi all,
Just bumping my earlier message. I think everything should be resolved now and github says all checks are passing. If anyone has a moment to take a look and merge this in (or give more feedback) that would be great!
Thanks for your patience Sara! I'll try to look at it this weekend. Our PR backlog has grown to scary levels unfortunately ....
Cheers, Ralf
Thanks, Sara
On Oct 3, 2021, at 12:13 PM, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote:
Hi all,
Is anyone available to wrap up reviewing my PR (https://github.com/scipy/scipy/pull/14579 <https://github.com/scipy/scipy/pull/14579>)? I think I’ve now resolved everything from our earlier discussion.
Thanks! Sara
On Aug 12, 2021, at 12:05 AM, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote:
I just sent a PR for this, looking forward to feedback :)
Best, Sara
On Jul 28, 2021, at 9:50 AM, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote:
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
I ended up translating the Moré and Wild CUTEST benchmark into python (from matlab) for my own experiment anyway, so I’d be happy to include that as a benchmark for scipy.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
It’s a bit nuanced; it looks like the current scipy implementation is faster than my orange implementation for some problems, but there are other problems where my orange implementation converges and the current scipy gets stuck. I suspect the former is probably because of the fancier finite differencing in scipy, and the latter might be because I added a little check in the BFGS update (since the benchmark problems are nonconvex): if BFGS ever tries to take an ascent direction, I replace it with steepest descent for that step, and reset the BFGS matrix to identity.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2.
Yes, I think this would be good.
Best, Sara
On Jul 28, 2021, at 12:38 AM, Andrew Nelson <andyfaff@gmail.com <mailto:andyfaff@gmail.com>> wrote:
On Wed, 28 Jul 2021 at 15:35, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote: Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy.
The first hurdle is to ensure that correctness isn't affects. The second hurdle is to show that changes improve things. I'm not sure that we have a comprehensive test suite for these kinds of comparisons for scipy. There are functions in benchmarks/test_functions.py, but we don't have something more comprehensive like CUTEST. It would be good to have that!
My experiment is on the DFO benchmark of Moré and Wild (https://www.mcs.anl.gov/~more/dfo/ <https://www.mcs.anl.gov/~more/dfo/>), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient.
If I understand correctly according to that graph your implementation in orange is better than the existing scipy implementation?
The blue curve is the same as the orange one, but replacing the forward finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy (https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448 <https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448>, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same).
Note that the finite differencing for most of the optimizers is done by approx_derivative (https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257 <https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257>), which chooses the step size automatically. If `jac=None` for `_minimize_bfgs` then the default is for absolute steps, but you can use relative steps by specifying `jac='2-point', etc.
I'm guessing that the modification would be:
1. If user provides a callable(jac), then use that in derphi. 2. If jac is estimated via a FD method then estimate derphi by finite difference along the search direction, as you suggest. Estimate grad by FD of `fun`. Both grad and derphi are calculated in line_search_wolfe1, line_search_wolfe2. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org <mailto:SciPy-Dev@python.org> https://mail.python.org/mailman/listinfo/scipy-dev <https://mail.python.org/mailman/listinfo/scipy-dev>
_______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org <mailto:scipy-dev@python.org> To unsubscribe send an email to scipy-dev-leave@python.org <mailto:scipy-dev-leave@python.org> https://mail.python.org/mailman3/lists/scipy-dev.python.org/ <https://mail.python.org/mailman3/lists/scipy-dev.python.org/> Member address: ralf.gommers@gmail.com <mailto:ralf.gommers@gmail.com> _______________________________________________ SciPy-Dev mailing list -- scipy-dev@python.org <mailto:scipy-dev@python.org> To unsubscribe send an email to scipy-dev-leave@python.org <mailto:scipy-dev-leave@python.org> https://mail.python.org/mailman3/lists/scipy-dev.python.org/ <https://mail.python.org/mailman3/lists/scipy-dev.python.org/> Member address: sarafridov@gmail.com <mailto:sarafridov@gmail.com>

Great post! it’s full of insightful information and enjoyable post.!!!! https://bit.ly/3me2hcl
participants (4)
-
Andrew Nelson
-
Mary Amos
-
Ralf Gommers
-
Sara Fridovich-Keil