Hi Andrew (and others),

Thanks for the input. I can describe the experiment I’ve done so far that makes me believe this simpler finite differencing would be an improvement for the DFO case, but I’m not sure if there is a standard benchmark that is used for making these decisions in scipy. 

My experiment is on the DFO benchmark of Moré and Wild (https://www.mcs.anl.gov/~more/dfo/), comparing 3 algorithms. The green curve is just calling the current implementation of fmin_bfgs. The orange curve is my own implementation of fmin_bfgs based on scipy but using a simple forward finite differencing approximation to the full gradient. The blue curve is the same as the orange one, but replacing the forward finite differencing approximation of the full gradient with a forward finite differencing approximation of just the directional derivative (inside the Wolfe line search). The sampling distance for finite differencing is 1.4901161193847656e-08 (for both the orange and blue curves), which is the same default as scipy (https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448, and the value recommended in Nocedal and Wright). The blue curve is the algorithm I’m proposing; I just included the orange one to compare the simple vs sophisticated full-gradient finite differencing methods (and any differences in the outer BFGS, which should be basically the same).

The plot is a performance profile, where the y axis is the fraction of benchmark problems solved to a desired precision and the x axis is the performance ratio, which is the ratio of the number of function evaluations used by each algorithm compared to the number of function evaluations used by the most efficient algorithm on each problem. For instance, in this experiment the current implementation is fastest on about 42% of the benchmark problems, and the proposed version is fastest on about 58% of the benchmark problems. The proposed implementation never uses more than 3x the fewest number of function evaluations to converge, whereas on about 9% of the benchmark problems the current implementation fails to converge within 10x the number of function evaluations.

As far as literature/theory, I’m not aware of this exact comparison (which is part of why I set up this experiment). My instinct is that the line search subroutine only really cares about the directional derivative in this specific search direction, so I would expect that estimating that directly might even have some benefits compared to taking samples in standard basis directions and then taking an inner product. Certainly there is some sensitivity to the sampling radius, but this is true even in the case of estimating the full gradient (I’m not sure I would expect it to be worse for just estimating the directional derivative). If anyone does have specific references/experience, please chime in!

Is there a standard benchmark test suite that scipy uses to make these kinds of algorithm implementation decisions? (It would also make sense to me to keep both options around, have a default, and let users deviate from it if they want.)

Best,
Sara