hi all, I'm trying to connect a new optimizer (line-search via cubic interpolation) to Matthieu's code. Here are some things that should be discussed, I guess in wide circle. I have already committed ./solvers/optimizers/line_search/qubic_interpolation.py tests/test_qubic_interpolation.py the problems: 1. I have implemented stop tolerance x as self.minStepSize. However, isn't it more correct to observe |x_prev - x_new| according to given by user xtol, than to observe |alpha_prev - alpha_new| according to xtol? If the routine is called from a multi-dimensional NL problem with known xtol, provided by user, I think it's more convenient and more correct to observe |x_prev - x_new| instead of |alpha_prev - alpha_new| as stop criterion. 2. (this is primarily for Matthieu): where should be gradtol taken from? It's main stop criterion, according to alg. Currently I just set it to 1e-6. 3. Don't you think that maxIter and/or maxFunEvals rule should be added? (I ask because I didn't see that ones in Matthieu's quadratic_interpolation solver). It will make algorithms more stable to CPU-hanging errors because of our errors and/or special funcs encountered. I had implemented that ones but since Matthieu didn't have them in quadratic_interpolation, I just comment out the stop criteria (all I can do is to set my defaults like 400 or 1000 (as well as gradtol 1e-6), but since Matthieu "state" variable (afaik) not have those ones - I can't take them as parameters). So should they exist or not? regards, D.
Hi again ;) I have already committed
./solvers/optimizers/line_search/qubic_interpolation.py tests/test_qubic_interpolation.py
qubic should be cubic, no ? the problems:
1. I have implemented stop tolerance x as self.minStepSize. However, isn't it more correct to observe |x_prev - x_new| according to given by user xtol, than to observe |alpha_prev - alpha_new| according to xtol? If the routine is called from a multi-dimensional NL problem with known xtol, provided by user, I think it's more convenient and more correct to observe |x_prev - x_new| instead of |alpha_prev - alpha_new| as stop criterion.
The basic cubic interpolation works on alpha. If you want to implement another based on x, not problem. I think that as a first step, we should add standard algorithms that are documented and described. After this step is done, we can explore. 2. (this is primarily for Matthieu): where should be gradtol taken from?
It's main stop criterion, according to alg. Currently I just set it to 1e-6.
It should be taken in the constructor (see the damped_line_search.py for instance) 3. Don't you think that maxIter and/or maxFunEvals rule should be added?
(I ask because I didn't see that ones in Matthieu's quadratic_interpolation solver).
That is a good question that raised also in our discussion for the Strong Wolfe Powell rules, at least for the maxIter. It will make algorithms more stable to
CPU-hanging errors because of our errors and/or special funcs encountered. I had implemented that ones but since Matthieu didn't have them in quadratic_interpolation, I just comment out the stop criteria (all I can do is to set my defaults like 400 or 1000 (as well as gradtol 1e-6), but since Matthieu "state" variable (afaik) not have those ones - I can't take them as parameters). So should they exist or not?
If you want to use them, you should put them in the __init__ method as well. The state could be populated with everything, but that would mean very cumbersome initializations. On one hand, you should create each module with no parameter and pass all of them to the optimizer. That could mean a very long and not readable line. On the other hand, you should create the optimizer, create then every module with the optimizer as a parameter. Not intuitive enough. This is were the limit between the separation principle and the object orientation is fuzzy. So the state dictionary is only responsible for what is specifically connected to the function. Either the parameters, or different evaluations (hessian, gradient, direction and so on). That's why you "can't" put gradtol in it (for instance). I saw that you test for the presence of the gradient method, you should not. If people want to use this line search, they _must_ provide a gradient. If they can't provide an analytical gradient, they can provide a numerical one by using helpers.ForwardFiniteDifferenceDerivatives. This is questionable, I know, but the simpler the algorithm, the simpler their use, their reading and debugging (that way, you can get rid of the f_and_df function as well or at least of the test). Matthieu
Hi again ;)
I have already committed ./solvers/optimizers/line_search/qubic_interpolation.py tests/test_qubic_interpolation.py
qubic should be cubic, no ? yes, I will rename it now
the problems: 1. I have implemented stop tolerance x as self.minStepSize. However, isn't it more correct to observe |x_prev - x_new| according to given by user xtol, than to observe |alpha_prev - alpha_new| according to xtol? If the routine is called from a multi-dimensional NL problem with known xtol, provided by user, I think it's more convenient and more correct to observe |x_prev - x_new| instead of |alpha_prev - alpha_new| as stop criterion.
The basic cubic interpolation works on alpha. If you want to implement another based on x, not problem. I think that as a first step, we should add standard algorithms that are documented and described. After this step is done, we can explore. yes, but all your solvers are already written in terms of n-dimensional
Matthieu Brucher wrote: problem (x0 and direction, both are of nVars size), so it would be more natural to use xtol (from general problem), not alpha_tol (from line-search subproblem)
2. (this is primarily for Matthieu): where should be gradtol taken from? It's main stop criterion, according to alg. Currently I just set it to 1e-6.
It should be taken in the constructor (see the damped_line_search.py for instance)
3. Don't you think that maxIter and/or maxFunEvals rule should be added? (I ask because I didn't see that ones in Matthieu's quadratic_interpolation solver).
That is a good question that raised also in our discussion for the Strong Wolfe Powell rules, at least for the maxIter.
As for SWP, I think a check should be made if solution with required c1 and c2 can't be obtained and/or don't exist at all. For example objFun(x) = 1e-5*x while c1 = 1e-4 (IIRC this is the example where I encountered alpha = 1e-28 -> f0 = f(x0+alpha*direction)). The SWP-based solver should produce something different than CPU hangup. OK, it turned to be impossible to obtain new_X that satisfies c1 and c2, but an approximation very often will be enough good to continue solving the NL problem involved. So I think check for |x_prev - x_new | < xtol should be added and will be very helpful here. You have something like that with alphap in line s68-70 (swp.py) but this is very unclear and I suspect for some problems may be endless (as well as other stop creteria implemented for now in the SWP).
It will make algorithms more stable to CPU-hanging errors because of our errors and/or special funcs encountered. I had implemented that ones but since Matthieu didn't have them in quadratic_interpolation, I just comment out the stop criteria (all I can do is to set my defaults like 400 or 1000 (as well as gradtol 1e-6), but since Matthieu "state" variable (afaik) not have those ones - I can't take them as parameters). So should they exist or not?
If you want to use them, you should put them in the __init__ method as well.
The state could be populated with everything, but that would mean very cumbersome initializations. On one hand, you should create each module with no parameter and pass all of them to the optimizer. That could mean a very long and not readable line. On the other hand, you should create the optimizer, create then every module with the optimizer as a parameter. Not intuitive enough. This is were the limit between the separation principle and the object orientation is fuzzy. So the state dictionary is only responsible for what is specifically connected to the function. Either the parameters, or different evaluations (hessian, gradient, direction and so on). That's why you "can't" put gradtol in it (for instance).
I'm not know your code very good yet, but why can't you just set default params as I do in /Kernel/BaseProblem.py? And then if user wants to change any specific parameter - he can do it very easy. And no "very long and not readable line" are present in my code.
I saw that you test for the presence of the gradient method, you should not. If people want to use this line search, they _must_ provide a gradient. If they can't provide an analytical gradient, they can provide a numerical one by using helpers.ForwardFiniteDifferenceDerivatives. This is questionable, I know, but the simpler the algorithm, the simpler their use, their reading and debugging (that way, you can get rid of the f_and_df function as well or at least of the test).
I still think the approach is incorrect, user didn't ought to supply gradient, we should calculate it by ourselves if it's absent. At least any known to me optimization software do the trick. As for helpers.ForwardFiniteDifferenceDerivatives, it will take too much time for user to dig into documentation to find the one. Also, as you see my f_and_df is optimized to not recalculate f(x0) while gradient obtaining numerically, like some do, for example approx_fprime in scipy.optimize. For problems with costly funcs and small nVars (1..5) speedup can be significant. Of course, it should be placed in single file for whole "optimizers" package, like I do in my ObjFunRelated.py, not in qubic_interpolation.py. But it would be better would you chose the most appropriate place (and / or informed me where is it). Regards, D.
The basic cubic interpolation works on alpha. If you want to implement another based on x, not problem. I think that as a first step, we should add standard algorithms that are documented and described. After this step is done, we can explore. yes, but all your solvers are already written in terms of n-dimensional problem (x0 and direction, both are of nVars size), so it would be more natural to use xtol (from general problem), not alpha_tol (from line-search subproblem)
xtol is not available everywhere, it's something that is criterion-dependent. It might be given or not. The documented algorithm is based on alpha, and our main implementation should follow the documented algorithm as well.
That is a good question that raised also in our discussion for the Strong Wolfe Powell rules, at least for the maxIter. As for SWP, I think a check should be made if solution with required c1 and c2 can't be obtained and/or don't exist at all. For example objFun(x) = 1e-5*x while c1 = 1e-4 (IIRC this is the example where I encountered alpha = 1e-28 -> f0 = f(x0+alpha*direction)). The SWP-based solver should produce something different than CPU hangup. OK, it turned to be impossible to obtain new_X that satisfies c1 and c2, but an approximation very often will be enough good to continue solving the NL problem involved.
According to the theory, a solution is found is the direction is correct and if the function is continuously differentiable. But I agree that some additional tests must be added. So I think check for |x_prev - x_new | < xtol
should be added and will be very helpful here. You have something like that with alphap in line s68-70 (swp.py) but this is very unclear and I suspect for some problems may be endless (as well as other stop creteria implemented for now in the SWP).
This check is done because of the interpolation nature of the WP algorithm. You can't put |x_prev_x_new| < xtol as a stopping criteria. This is why your idea of limiting the number of iterations is relevant.
connected to the function. Either the parameters, or different evaluations (hessian, gradient, direction and so on). That's why you "can't" put gradtol in it (for instance). I'm not know your code very good yet, but why can't you just set default
So the state dictionary is only responsible for what is specifically params as I do in /Kernel/BaseProblem.py?
Because there are a lot of default parameters that could be set depending on the algorithm. From an object-oriented point of view, this way of doing things is correct: the different modules posess the arguments because they are responsible for using them. Besides, you may want a different gradient tolerance for different sub-modules. And then if user wants to change any specific parameter - he can do it
very easy. And no "very long and not readable line" are present in my code.
It won't be in your code, it would be in the user code. The goal of the framework is to make things readable and usable. If every parameter is only given to the optimizer, the other submodules are no longer of interest. With the parameters given to each specific module with respect of their belonging, one can create several instances of a specific module to use for different optimizers. This is what object-oriented code is intended for. I've done this in my PhD code, and it works very well. I still think the approach is incorrect, user didn't ought to supply
gradient, we should calculate it by ourselves if it's absent. At least any known to me optimization software do the trick.
Perhaps, but I have several reasons : - when it's hidden, it's a magic trick. My view of the framework is that it must not do anything like that. It's designed for advanced users that do not want those tricks - from an architectural point of view, it's wrong, plainly wrong. I'm a scientist specialized in electronics and signal processing, but I have high requirements for everything that is IT-oriented. Encapsulation is one part of the object principle, and implementing finite-difference outside breaks it (and I'm not talking about code duplication). As for
helpers.ForwardFiniteDifferenceDerivatives, it will take too much time for user to dig into documentation to find the one.
Not if it's clearly stated in the documentation, perhaps it can be enhanced. It's on the first optimization page in the scikit wiki, with examples. Also, as you see my f_and_df is optimized to not recalculate f(x0) while
gradient obtaining numerically, like some do, for example approx_fprime in scipy.optimize. For problems with costly funcs and small nVars (1..5) speedup can be significant.
Yes, I agree. For optimization, an additional argument could be given to the gradient that will be used if needed (remember that the other way of implementing finite difference do not use f(x0)), but it will bring some trouble to the user (every gradient function must have this additional argument). This should be discussed in a specific topic, because I don't think there is a simple answer to this specific point (and as I stated before I prefer something complete but not as fast than something incomplete but very fast, and there is nothing like free lunch in IT, we have to make compromises). Of course, it should be placed in single file for whole "optimizers"
package, like I do in my ObjFunRelated.py, not in qubic_interpolation.py. But it would be better would you chose the most appropriate place (and / or informed me where is it).
It's already implemented (both versions) and available. Matthieu
Matthieu Brucher wrote:
> So the state dictionary is only responsible for what is specifically > connected to the function. Either the parameters, or different > evaluations (hessian, gradient, direction and so on). That's why you > "can't" put gradtol in it (for instance). I'm not know your code very good yet, but why can't you just set default params as I do in /Kernel/BaseProblem.py?
Because there are a lot of default parameters that could be set depending on the algorithm. From an object-oriented point of view, this way of doing things is correct: the different modules posess the arguments because they are responsible for using them. Besides, you may want a different gradient tolerance for different sub-modules.
I do have other gradtol for different problems, for example NLP and NSP (non-smooth). In any problem class default gradtol value should be known to user constant, like TOMLAB do. Setting different gradtol for each solver is senseless, it's matter of how close xk is to x_opt (of course, if function is tooo special and/or non-convex and/or non-smooth default value maybe is worth to be changed by other, but it must decide user according to his knowledge of function). The situation to xtol or funtol or diffInt is more complex, but still TOMLAB has their default constants common to all algs, almost in the same way as I do, and years of successful TOMLAB spreading (see http://tomopt.com/tomlab/company/customers.php) are one more evidence that this approach is good. As for special solver params, like space transformation parameter in ralg, it can be passed from dictionary or something like that, for example in openopt for MATLAB I do p.ralg.alpha = 2.5 p.ralg.n2 = 1.2 in Python it's not implemented yet, but I intend to do something like p = NSP(...) p.ralg = {'alpha' : 2.5, 'n2': 1.2} (so other 3 ralg params remain unchanged - they will be loaded from default settings) r = p.solve('ralg') either p.ralg = p.setdefaults('ralg') print p.ralg.alpha # previous approach didn't allow ... r = p.solve('ralg') BTW TOMLAB handle numerical gradient obtaining the same way that I do. I think 90% of users doesn't care at all about which tolfun, tolx etc are set and which way gradient is calculating - maybe, lots of them don't know at all that gradient is obtaining. Trey just require problem to be solved - and no matter which way, use that way gradient or not (as you see, the only one scipy optimizer that handles non-lin constraints is cobyla and it takes no gradient by user). Of course, if problem is too complex to be solved quickly, they can begun investigating ways to speedup and gradient probably will be the 1st one.
I still think the approach is incorrect, user didn't ought to supply gradient, we should calculate it by ourselves if it's absent. At least any known to me optimization software do the trick.
Perhaps, but I have several reasons : - when it's hidden, it's a magic trick. My view of the framework is that it must not do anything like that. It's designed for advanced users that do not want those tricks - from an architectural point of view, it's wrong, plainly wrong. I'm a scientist specialized in electronics and signal processing, but I have high requirements for everything that is IT-oriented. Encapsulation is one part of the object principle, and implementing finite-difference outside breaks it (and I'm not talking about code duplication).
So, as I said, it could be solved like p.showDefaults('ralg') or p = NLP(...), print p.xtol. For 99.9% of users it should be enough.
Also, as you see my f_and_df is optimized to not recalculate f(x0) while gradient obtaining numerically, like some do, for example approx_fprime in scipy.optimize. For problems with costly funcs and small nVars (1..5) speedup can be significant.
Yes, I agree. For optimization, an additional argument could be given to the gradient that will be used if needed (remember that the other way of implementing finite difference do not use f(x0)), but it will bring some trouble to the user (every gradient function must have this additional argument).
I didn't see any troubles for user, he just provides either f and df or only f, as anywhere else, w/o any changes. Or did you mean gradient to be func(x, arg1, arg2,...)? There are no troubles as well - for example redefinition df = lambda x: func(x, arg1, arg2) from the very beginning. As for my openopt, there are lots of tools to prevent recalculating for twice. One of them is to check for previous x, if its the same - return previous fval (cval, hval). for example if a solver (or user from his df) calls F = p.f(x) DF = p.df(x) and df is calculating numerically, it will not recalculate f(x=x0), it will just use previous value from calculating p.f (because x is equal to p.FprevX). the same to dc, dh (c(x)<=0, h(x)=0). As you know, comparison numpy.all(x==xprev) don't take too much time, at least it takes much less than 0.1% of whole time/cputime elapsed, while I was observing MATLAB profiler results on different NL problems. Regards, D.
I do have other gradtol for different problems, for example NLP and NSP (non-smooth). In any problem class default gradtol value should be known to user constant, like TOMLAB do. Setting different gradtol for each solver is senseless, it's matter of how close xk is to x_opt (of course, if function is tooo special and/or non-convex and/or non-smooth default value maybe is worth to be changed by other, but it must decide user according to his knowledge of function).
There is a difference between a gradient tolerance for the global stop criterion that works on the real gradient and a gradient tolerance in the line search that works on a scalar. We can have the same default value for all gradient tolerance, but the framework is not designed to unify them. It's not its purpose. The situation to xtol or funtol or diffInt is more complex, but still
TOMLAB has their default constants common to all algs, almost in the same way as I do, and years of successful TOMLAB spreading (see http://tomopt.com/tomlab/company/customers.php) are one more evidence that this approach is good.
xtol or funtol are used in the criteria, that is only one file. Not much trouble. And diffInt is a sort of universal value (the square root of the epsilon for a 32bits float). Once more, it's defined in one place. BTW TOMLAB handle numerical gradient obtaining the same way that I do. I
think 90% of users doesn't care at all about which tolfun, tolx etc are set and which way gradient is calculating - maybe, lots of them don't know at all that gradient is obtaining. Trey just require problem to be solved - and no matter which way, use that way gradient or not (as you see, the only one scipy optimizer that handles non-lin constraints is cobyla and it takes no gradient by user).
These people are not the prime target of the framework. They want a fully fledged function that take care of everything, it's exactly what you did with lincher for instance. On the contrary, here, users must know a little bit about optimization because they have to build the optimizer they want. I don't expect 50% of the scipy.optimize users to use this framework. As they must know about what optimization is, I can assume that they know about finite-difference methods, tolerance on the parameters, ... So, as I said, it could be solved like p.showDefaults('ralg') or p =
NLP(...), print p.xtol. For 99.9% of users it should be enough.
Yes I think we could add a dictionary "à la matplotlib" or something like it. I don't think it is a priority to do it. It can be added seemlessly when the optimizers are stabilized. I didn't see any troubles for user, he just provides either f and df or
only f, as anywhere else, w/o any changes. Or did you mean gradient to be func(x, arg1, arg2,...)? There are no troubles as well - for example redefinition df = lambda x: func(x, arg1, arg2) from the very beginning.
If gradient can accept another formal or positional argument, every gradient in a user-defined function class must have it. As it is done now, it's not acceptable (we cannot require additional arguments because the gradient might be computed with forward differences). As for my openopt, there are lots of tools to prevent recalculating for
twice. One of them is to check for previous x, if its the same - return previous fval (cval, hval). for example if a solver (or user from his df) calls F = p.f(x) DF = p.df(x) and df is calculating numerically, it will not recalculate f(x=x0), it will just use previous value from calculating p.f (because x is equal to p.FprevX). the same to dc, dh (c(x)<=0, h(x)=0). As you know, comparison numpy.all(x==xprev) don't take too much time, at least it takes much less than 0.1% of whole time/cputime elapsed, while I was observing MATLAB profiler results on different NL problems.
The last computed x can be added to the state dictionary, do it if that pleases you. But as I stated before, I prefer something tested and complete but not optimized than something not tested or not complete but optimized. BTW, I fixed your tests for the cubic interpolation. Matthieu
participants (2)
-
dmitrey -
Matthieu Brucher