
Hi Andrew (and others), Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy. My experiment is on the DFO benchmark of Moré and Wild (https://www.mcs.anl.gov/~more/dfo/ <https://www.mcs.anl.gov/~more/dfo/>), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient. The blue curve is the same as the orange one, but replacing the forward finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy (https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448 <https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448>, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same). The plot is a performance profile, where the y axis is the fraction of benchmark problems solved to a desired precision and the x axis is the performance ratio, which is the ratio of the number of function evaluations used by each algorithm compared to the number of function evaluations used by the most efficient algorithm on each problem. For instance, in this experiment the current implementation is fastest on about 42% of the benchmark problems, and the proposed version is fastest on about 58% of the benchmark problems. The proposed implementation never uses more than 3x the fewest number of function evaluations to converge, whereas on about 9% of the benchmark problems the current implementation fails to converge within 10x the number of function evaluations. As far as literature/theory, I’m not aware of this exact comparison (which is part of why I set up this experiment). My instinct is that the line search subroutine only really cares about the directional derivative in this specific search direction, so I would expect that estimating that directly might even have some benefits compared to taking samples in standard basis directions and then taking an inner product. Certainly there is some sensitivity to the sampling radius, but this is true even in the case of estimating the full gradient (I’m not sure I would expect it to be worse for just estimating the directional derivative). If anyone does have specific references/experience, please chime in! Is there a standard benchmark test suite that scipy uses to make these kinds of algorithm implementation decisions? (It would also make sense to me to keep both options around, have a default, and let users deviate from it if they want.) Best, Sara
On Jul 27, 2021, at 7:24 PM, Andrew Nelson <andyfaff@gmail.com> wrote:
On Wed, 28 Jul 2021 at 03:24, Sara Fridovich-Keil <sarafridov@gmail.com <mailto:sarafridov@gmail.com>> wrote: I have been using scipy.optimize.fmin_bfgs on some derivative-free benchmark problems, and noticed that whenever the Wolfe line search requires a directional derivative, the current implementation estimates the entire gradient via finite differencing, and then computes the directional derivative by taking the inner product of the gradient and the search direction. In my experiments, replacing this full gradient estimation with a single extra function evaluation to estimate the directional derivative directly, is faster.
What I’d like to do is have an option for fprime to be either provided or not provided to the Wolfe line search. If the objective has a nice/cheap gradient then the current behavior is fine (passing the gradient function as fprime, and computing directional derivatives with an inner product), but if the objective is derivative-free then derphi should be computed with finite differencing along the search direction (just one extra function evaluation) instead of using fprime.
Estimating gradients with finite differences relies on a good choice of step size. Good step defaults are automatically chosen by `optimize._numdiff.approx_derivative` when fprime is estimated numerically. Estimating derphi with numerical differentiation along the search direction would first of all require a good step size along the search direction, `pk`. Whilst this may be ok if the parameter scaling, pk, and the derivatives are well chosen/well behaved (e.g. all the parameters are the same magnitude, etc), I'm concerned that there will be cases where it won't be as numerically accurate/stable as the existing behaviour. For example, a chosen step along pk may result in individual dx that aren't optimal from a numerical differentiation viewpoint. How would one know if a specific system was exhibiting that behaviour? I'd expect the current code to be more robust than your proposed alternative. Having said that, I'm not an expert in this domain, so I'd be interested to hear what someone who is more expert than me has to say. Can you point to any literature that says that your proposed changes are generally acceptable?
Andrew. _______________________________________________ SciPy-Dev mailing list SciPy-Dev@python.org https://mail.python.org/mailman/listinfo/scipy-dev