[SciPy-Dev] directional derivatives in Wolfe line search

Wed Jul 28 03:38:29 EDT 2021

On Wed, 28 Jul 2021 at 15:35, Sara Fridovich-Keil <sarafridov at gmail.com>
wrote:

> Thanks for the input. I can describe the experiment I’ve done so far that
> makes me believe this simpler finite differencing would be an improvement
> for the DFO case, but I’m not sure if there is a standard benchmark that is
> used for making these decisions in scipy.
>

 The first hurdle is to ensure that correctness isn't affects. The second
hurdle is to show that changes improve things. I'm not sure that we have a
comprehensive test suite for these kinds of comparisons for scipy. There
are functions in benchmarks/test_functions.py, but we don't have something
more comprehensive like CUTEST. It would be good to have that!

> My experiment is on the DFO benchmark of Moré and Wild (
> https://www.mcs.anl.gov/~more/dfo/), comparing 3 algorithms. The green
> curve is just calling the current implementation of fmin_bfgs. The orange
> curve is my own implementation of fmin_bfgs based on scipy but using a
> simple forward finite differencing approximation to the full gradient.
>

If I understand correctly according to that graph your implementation in
orange is better than the existing scipy implementation?

The blue curve is the same as the orange one, but replacing the forward
> finite differencing approximation of the full gradient with a forward
> finite differencing approximation of just the directional derivative
> (inside the Wolfe line search). The sampling distance for finite
> differencing is 1.4901161193847656e-08 (for both the orange and blue
> curves), which is the same default as scipy (
> https://github.com/scipy/scipy/blob/v1.7.0/scipy/optimize/optimize.py#L1448,
> and the value recommended in Nocedal and Wright). The blue curve is the
> algorithm I’m proposing; I just included the orange one to compare the
> simple vs sophisticated full-gradient finite differencing methods (and any
> differences in the outer BFGS, which should be basically the same).
>

Note that the finite differencing for most of the optimizers is done by
approx_derivative (
https://github.com/scipy/scipy/blob/master/scipy/optimize/_numdiff.py#L257),
which chooses the step size automatically. If `jac=None` for
`_minimize_bfgs` then the default is for absolute steps, but you can use
relative steps by specifying `jac='2-point', etc.

 I'm guessing that the modification would be:

1. If user provides a callable(jac), then use that in derphi.
2. If jac is estimated via a FD method then estimate derphi by finite
difference along the search direction, as you suggest. Estimate grad by FD
of `fun`. Both grad and derphi are calculated in line_search_wolfe1,
line_search_wolfe2.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.python.org/pipermail/scipy-dev/attachments/20210728/c1dda015/attachment.html>