Rationalisation of finite differences for minimize
Dear devs, In order to fully address https://github.com/scipy/scipy/issues/6026 I propose the following behaviour for optimize.minimize. For minimize methods that use a jacobian: 1) if a callable jacobian is supplied simply use that. (no change) 2) if `jac is True`, then `fun` is assumed to return function and gradient. (no change) 3) if `jac is None` or `bool(jac)` evaluates to `False`, then use forward differences with an absolute step of epsilon. This behaviour ensures back compatibility is kept for those methods that already accept an `epsilon` or `eps` keyword. 4) have an additional `finite_diff_rel_step` keyword. For further information see `optimize._numdiff.approx_derivative`. This would be changed behaviour for some methods. trust-constr already has this keyword (which is why I suggested it). This keyword would only be added to those that already have an `epsilon` or `eps` keyword. 5) if `jac in ['2-point', '3-point', 'cs']`, in this case the finite differences would not use an absolute step size, but would use either forward differences/central differences/complex step for gradient calculation. The step size would be that specified by `finite_diff_rel_step`. This would only work for those methods that now possess the `finite_diff_rel_step` keyword. Gradient calculation would be done by `_differentiable_functions.ScalarFunction` which eventually calls `_numdiff.approx_derivative`. This approach is already used by trust-constr. `ScalarFunction` caches function evaluations. 6) there are some methods that only work if points 1 or 2 are satisfied (i.e. `callable(jac) or jac is True`). These methods are: trust-krylov, trust-ncg, trust-exact, dogleg, Newton-CG (I hope I have them all). They raise an exception if a callable gradient function is not provided by the user. It's unclear to me if these methods can use a gradient calculated by finite differences. If they can, then I propose to add the ability to use finite differences with points 1, 2, 4, 5 (not 3) above. I need feedback on this point from subject experts. 7) Possibly remove `approx_jacobian` from slsqp.py. This should be do-able by `approx_derivative` instead. 8) Fix up documentation for these options. Enhancements -------------------- a) the better tested `approx_derivative` is now used instead of `approx_fprime`. b) relative step sizes are much preferred to absolute step size as the truncation and round off errors are balanced. c) there is hopefully a more consistent interface to jac specification d) gradients calculated by central-differences are more accurate than 2-point calculation, it's nice to have the option to use that (at the expense of function evaluations). e) it should be possible to reduce the number of function evaluations by a small amount because ScalarFunction caches values. f) Bounds can be supplied to ScalarFunction/approx_derivative. If a parameter value is at a boundary, then either of those two functions should not take a step outside the boundary whilst evaluating the gradient. Drawbacks --------------- g) Technically this meets back compatibility according to documentation. However, some users may be using jac=`2-point`, which at the moment may be converted to an absolute step+forward difference calculation. The documentation doesn't say that an absolute step is being used internally. The proposed behaviour would move it to a relative step. Thus users would get a slightly changed implementation. Backwards compatibility is only assured by setting `jac=False` or `jac=None`. h) minimize is a complex beast with lots of interlocking behaviour, I don't think this will increase complexity by an appreciable amount, but there's definitely a possibility someone is upset because behaviour is changed slightly. I'd appreciate feedback on this, especially why gradient functions are required (point 6) for some methods. Hopefully it's not too controversial.
1, 2, and 5-8 look good to me. 3) if `jac is None` or `bool(jac)` evaluates to `False`, then use forward
differences with an absolute step of epsilon. This behaviour ensures back compatibility is kept for those methods that already accept an `epsilon` or `eps` keyword.
I would suggest that if `eps` is explicitly defined by the user, then the value should be respected, but if it is not explicitly defined and `bool(jac)` is `False`, we might upgrade the derivative approximation to approx_derivative. If this is deemed a real backwards compatibility issue, never mind.
4) have an additional `finite_diff_rel_step` keyword. For further information see `optimize._numdiff.approx_derivative`. This would be changed behaviour for some methods. trust-constr already has this keyword (which is why I suggested it). This keyword would only be added to those that already have an `epsilon` or `eps` keyword.
What will happen if the user specifies both? Rather than having a new argument, I might have suggested that we use a provided `eps` as a relative step size if `jac` is one of the allowed strings. Then again, if `finite_diff_rel_step` is already used by `trust-constr`, it makes sense to use it. 6) there are some methods that only work if points 1 or 2 are satisfied
(i.e. `callable(jac) or jac is True`). These methods are: trust-krylov, trust-ncg, trust-exact, dogleg, Newton-CG (I hope I have them all). They raise an exception if a callable gradient function is not provided by the user. It's unclear to me if these methods can use a gradient calculated by finite differences. If they can, then I propose to add the ability to use finite differences with points 1, 2, 4, 5 (not 3) above. I need feedback on this point from subject experts.
If you don't hear back from the experts, I'd say it's worth a shot to try adding this ability to use finite difference approximations to these algorithms. Some algorithms are more robust to derivative approximation errors than others, sure, but I've not heard of an algorithm that requires its derivatives to be accurate to same machine precision as the function evaluation. Drawbacks
--------------- g) Technically this meets back compatibility according to documentation. However, some users may be using jac=`2-point`, which at the moment may be converted to an absolute step+forward difference calculation. The documentation doesn't say that an absolute step is being used internally. The proposed behaviour would move it to a relative step. Thus users would get a slightly changed implementation. Backwards compatibility is only assured by setting `jac=False` or `jac=None`.
(If we think the suggestion for 3 above is OK, then backwards compatibility is assured by setting `jac=False` or `jac=None` AND explicitly defining `eps`)
participants (2)
-
Andrew Nelson -
Matt Haberland