Folks, The usual confessions: I am relatively new to Python programming and SciPy. I have a problem I'm looking for some help in solving. I have a list of data pairs [[x1,y1], [x2,y2], ..., [xn,yn]] and am trying to find the best fit of that data to an equation: y = a*(1+b*c*y)^(-1/b) The parameters, b and c, are constrained: 1) 0<b<=5 2) -1<=c<=1 Parameter a is only weakly constrained in that x is usually >= max(x1, x2, ... , xn). My "best fit" goal is either to minimize the root mean square deviation (or consequently maximize the r-square value). Any suggestions? Thanks. Brandon C. Nuttall BNUTTALL@UKY.EDU Kentucky Geological Survey (859) 257-5500 University of Kentucky (859) 257-1147 (fax) 228 Mining & Mineral Resources Bldg http://www.uky.edu/KGS/home.htm Lexington, Kentucky 40506-0107
Brandon Nuttall wrote:
Folks,
The usual confessions: I am relatively new to Python programming and SciPy. I have a problem I'm looking for some help in solving.
I have a list of data pairs [[x1,y1], [x2,y2], ..., [xn,yn]] and am trying to find the best fit of that data to an equation:
y = a*(1+b*c*y)^(-1/b)
Presumably, you mean y = a*(1 + b*c*x) ** (-1.0/b) to correct a typo and use Python notation.
The parameters, b and c, are constrained: 1) 0<b<=5 2) -1<=c<=1
Parameter a is only weakly constrained in that x is usually >= max(x1, x2, ... , xn).
My "best fit" goal is either to minimize the root mean square deviation (or consequently maximize the r-square value).
There are a number of constrained optimizers in scipy.optimize . scipy.optimize.fmin_tnc seems most appropriate for simple bounds like you have. In order to get a function to minimize that depends on data, I usually like to use a class: import numpy as np from scipy.optimize import fmin_tnc class LossFunction(object): def __init__(self, x, y): self.x = x self.y = y def __call__(self, abc): """ A function suitable for passing to the fmin() minimizers. """ a, b, c = abc y = a*(1.0 + b*c*self.x) ** (-1.0/b) dy = self.y - y return dy*dy x = np.array([...]) y = np.array([...]) lf = LossFunction(x, y) abc0 = np.array([x.max(), 2.5, 0.0]) # or whatever retcode, nfeval, abc_optimal = fmin_tnc(lf, abc0, bounds=[(None, None), (0., 5.), (-1., 1.)]) -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Hi, Sorry about the long post. I did not ask the original question, but I tried the solution posted and could not get it to work. Original question: given a set of data [[x1, y1], ..., [xn, yn]], how to fit y = a*(1 + b*c*x) ** (-1.0/b) to it, subject to the constraints: 0 < b <= 5 -1 <= c <= 1 a ~ >= max(x1, ... xn) On 27/02/07: 14:25, Robert Kern wrote:
There are a number of constrained optimizers in scipy.optimize . scipy.optimize.fmin_tnc seems most appropriate for simple bounds like you have. In order to get a function to minimize that depends on data, I usually like to use a class:
import numpy as np from scipy.optimize import fmin_tnc
class LossFunction(object): def __init__(self, x, y): self.x = x self.y = y
def __call__(self, abc): """ A function suitable for passing to the fmin() minimizers. """ a, b, c = abc y = a*(1.0 + b*c*self.x) ** (-1.0/b) dy = self.y - y return dy*dy
x = np.array([...]) y = np.array([...]) lf = LossFunction(x, y) abc0 = np.array([x.max(), 2.5, 0.0]) # or whatever retcode, nfeval, abc_optimal = fmin_tnc(lf, abc0, bounds=[(None, None), (0., 5.), (-1., 1.)])
I used optimize.leastsq to do the fitting, and it worked well me (see http://www.scipy.org/Cookbook/FittingData). But I was trying the above method, and unfortunately it doesn't work as presented. When I define the variables x and y by: from scipy.optimize import fmin_tnc from scipy import rand from numpy import mgrid, log def f(abc, x): return abc[0]*(1.0+abc[1]*abc[2]*x)**(-1.0/abc[1]) a = 15.0 b = 2.5 c = 0.3 abc = (a, b, c) num_points = 151 x = mgrid[1:10:num_points*1j] y = f(abc, x) + rand(num_points) - 0.5 abc0 = [x.max(), 2.5, 0.0] lf = LossFunction(x, y) retcode, nfeval, abc_optimal = fmin_tnc(lf, abc0, bounds=[(None, None), (0., 5.), (-1., 1.)]) The call to fmin_tnc gives me an error: File "/usr/lib/python2.4/site-packages/scipy/optimize/tnc.py", line 191, in func_and_grad f, g = func(x, *args) ValueError: too many values to unpack Looking at the documentation and the source file, the first parameter to fmin_tnc should be a function that returns the function value and the gradient vector (with respect to the parameters being estimated). Also, it seems that I can tell fmin_tnc to estimate the gradient itself, if I specify approx_grad as True: retcode, nfeval, abc_optimal = fmin_tnc(lf, abc0, bounds=[(None, None), (0., 5.), (-1., 1.)], approx_grad=True) But this gives me an error saying: File "/usr/lib/python2.4/site-packages/scipy/optimize/optimize.py", line 576, in approx_fprime grad[k] = (apply(f,(xk+ei,)+args) - f0)/epsilon ValueError: setting an array element with a sequence. Then I tried redefining the __call__ in LossFunction to return the gradient as well: class LossFunction(object): def __init__(self, x, y): self.x = x self.y = y def __call__(self, abc): a, b, c = abc y = a*(1.0 + b*c*self.x) ** (-1.0/b) dy = self.y - y d = [0, 0, 0] d[0] = (1.0 + b*c*self.x) ** (-1.0/b) d[1] = y * (log(a*(1.0+b*c*self.x))/b**2 - c*self.x/(b*(1.0+b*c*self.x))) d[2] = -a/b*(1.0 + b*c*self.x) ** (-1.0/b) * b*self.x return dy*dy, d When I call this function instead, I get another error: File "/usr/lib/python2.4/site-packages/scipy/optimize/tnc.py", line 221, in fmin_tnc fmin, ftol, rescale) ValueError: Bad return value from minimized function. At this point, I gave up, but I would like to understand what I am doing wrong. To me, it seems like fmin_tnc will find the values of parameters (a, b, c in this case) that minimize a given function that depends *only* on a, b, c. I don't understand how to make it work so as to fit existing data. I am sure I am missing something, but I don't know what that is. Thanks for all your help, Alok -- Alok Singhal * * Graduate Student, dept. of Astronomy * * * University of Virginia http://www.astro.virginia.edu/~as8ca/ * *
Robert, Thanks. At 03:25 PM 2/27/2007, you wrote:
Presumably, you mean
y = a*(1 + b*c*x) ** (-1.0/b)
to correct a typo and use Python notation.
Yes, you are correct.
import numpy as np from scipy.optimize import fmin_tnc
class LossFunction(object): def __init__(self, x, y): self.x = x self.y = y
def __call__(self, abc): """ A function suitable for passing to the fmin() minimizers. """ a, b, c = abc y = a*(1.0 + b*c*self.x) ** (-1.0/b) dy = self.y - y return dy*dy
x = np.array([...]) y = np.array([...]) lf = LossFunction(x, y) abc0 = np.array([x.max(), 2.5, 0.0]) # or whatever retcode, nfeval, abc_optimal = fmin_tnc(lf, abc0, bounds=[(None, None), (0., 5.), (-1., 1.)])
OK. I'm on my way home to try it out. Brandon C. Nuttall BNUTTALL@UKY.EDU Kentucky Geological Survey (859) 257-5500 University of Kentucky (859) 257-1147 (fax) 228 Mining & Mineral Resources Bldg http://www.uky.edu/KGS/home.htm Lexington, Kentucky 40506-0107
Folks, I've gotten further, but still am not there. I have the following code: import numpy as np from scipy.optimize import fmin_tnc class hyp_func: def __init__(self,*args): self.x, self.y = args self.avg = y.mean() def rmsd(self,*args): """A function suitable for passing to the fmin() minimizers """ a, b, c = args[0] sse = 0.0 sst = 0.0 # minimize the RMSD for i in range(len(x)): y = a*(1.0+b*c*x[i])**(-1.0/b) sse += (self.y[i]-y)**2 sst += (self.avg-y)**2 # this won't work because I don't know what x should be or if it is related diff = [0.0,0.0] # this is just an initial value and won't get any results # diff[0] = (-a*c-a*c*c*b*?????)**((-1-b)/b) # diff[1] = ((-1-b)/b)*(-a*c*c*b)*(-a*c-a*c*c*b*?????)**(((-1-b)/b)-1) return (sse / sst, diff) # fake data for testing a, b, c = [1.25,0.75,0.25] x = [] y = [] for i in range(1,61): x.append(float(i)) y.append(a*(1.0+b*c*float(i))**(-1.0/b)) x = np.array(x) y = np.array(y) myfunc=hyp_func(x,y) params0 = [1.0,0.5,0.5] retcode, nfeval, optimal = fmin_tnc(myfunc.rmsd,params0, bounds=[(None,None),(0.001,5.0), (-1.0,1.0)]) myfunc.rmsd() now returns the root mean square deviation which is to be minimized. However, in looking at the example code in scipy/optimize/tnc.py, I find that myfunc.rmsd() needs to return a second argument, g, the gradient of the function. It looks like this needs to be the first and second derivative of my function which if my [really, really] rusty calculus should be: y' = (-a*c-a*c*c*b*x)**((-1-b)/b) y'' = ((-1-b)/b)*(-a*c*c*b)*(-a*c-a*c*c*b*x)**(((-1-b)/b)-1) I'm not sure these derivatives are particularly informative about which way to go to minimize the RMSD. At the moment, the code fails with a SystemError (error return without exception set) in the call to moduleTNC.minimize(). So, any suggestions on my next step? FWIW: In tnc.py, in test(), the call to fmin_tnc() has a typo in the keyword arguments. "maxnfeval=" should be "maxfun=". When running tnc.py to execute the tests, example() and test1fg() run. The test2fg() detects an infeasible condition (apparently, I guess that is what was to be tested); Python raises an exception and terminates without running the other tests. Brandon C. Nuttall BNUTTALL@UKY.EDU Kentucky Geological Survey (859) 257-5500 University of Kentucky (859) 257-1147 (fax) 228 Mining & Mineral Resources Bldg http://www.uky.edu/KGS/home.htm Lexington, Kentucky 40506-0107
On 28/02/07: 10:56, Brandon Nuttall wrote:
myfunc.rmsd() now returns the root mean square deviation which is to be minimized. However, in looking at the example code in scipy/optimize/tnc.py, I find that myfunc.rmsd() needs to return a second argument, g, the gradient of the function. It looks like this needs to be the first and second derivative of my function which if my [really, really] rusty calculus should be:
y' = (-a*c-a*c*c*b*x)**((-1-b)/b) y'' = ((-1-b)/b)*(-a*c*c*b)*(-a*c-a*c*c*b*x)**(((-1-b)/b)-1)
I think the second return value should be a list of (first, partial) derivatives with respect to the parameters being estimated, i.e., a, b, c in your case. The example in optimize/tnc.py defines the function as pow(x[0],2.0)+pow(abs(x[1]),3.0), and the derivatives g[0] and g[1] are defined as 2.0*x[0] and sign(x[1])*3.0*pow(abs(x[1]),2.0), which are derivatives of f with respect to x[0] and x[1], the parameters. I calculated those derivatives (and I think did it right :-) ), you can see them in an earlier message by me in this thread.
I'm not sure these derivatives are particularly informative about which way to go to minimize the RMSD.
Yeah, that is why it makes more sense to calculate the derivatives wrt the parameters being estimated. Here is another way to do the fitting (adapted from scipy cookbook): import numpy as np from scipy import rand from scipy.optimize import leastsq import pylab n = 151 x = np.mgrid[1:10:n*1j] def f(p, x): return p[0]*(1.0+p[1]*p[2]*x)**(-1.0/p[1]) def errf(p, x, y): return f(p, x) - y abc = [15.0, 2.5, 0.3] y = f(abc, x) + rand(n) - 0.5 abc_guess = [30.0, 4.0, 1.0] abc1, success = leastsq(errf, abc_guess[:], args = (x, y)) print success # should be 1 print abc1 # I get [ 14.71976079 2.4536336 0.27560377] pylab.plot(x, y, x, f(abc1, x)) pylab.show() -Alok -- Alok Singhal * * Graduate Student, dept. of Astronomy * * * University of Virginia http://www.astro.virginia.edu/~as8ca/ * *
Brandon Nuttall wrote:
Folks,
I've gotten further, but still am not there. I have the following code:
import numpy as np from scipy.optimize import fmin_tnc
class hyp_func: def __init__(self,*args): self.x, self.y = args self.avg = y.mean() def rmsd(self,*args): """A function suitable for passing to the fmin() minimizers """ a, b, c = args[0] sse = 0.0 sst = 0.0 # minimize the RMSD for i in range(len(x)): y = a*(1.0+b*c*x[i])**(-1.0/b) sse += (self.y[i]-y)**2 sst += (self.avg-y)**2
Don't do this looping. Instead, just make sure that self.x and self.y are arrays and use array math. y = a*(1.0 + b*c*self.x) ** (-1.0/b) dy = self.y - y sse = (dy*dy).sum() sst = ((y - avg)**2).sum() -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
Folks, Thanks to Alok Singhal and Robert Kern I have not only learned a great deal about SciPy and NumPy, but I have code that works. Thanks for the tip on not looping; it does make cleaner code. I have two issues: 1) there must be a better way to convert a list of data pairs to two arrays, 2) I'm not sure of a graceful way to transition from one plot to the next and then close. It works and I'm going to plug it into other code that grabs data from a MySQL database. import numpy as np from scipy import rand from scipy.optimize import leastsq import pylab class HypObj: """Object for best fit to hyperbolic function""" def __init__(self,x,y,guess=None,plot=None): self.x = x self.y = y self.plot=plot if guess==None: self.guess = [y.max(),1.0,0.5] else: self.guess = guess self.parameters, self.success = leastsq(self._errf, self.guess[:], args = (self.x, self.y)) self.r2 = self._r2(self.x,self.y,self.parameters) if self.plot<>None: pylab.plot(self.x, self.y, 'ro', self.x, self._f(self.parameters, self.x), 'b--') pylab.show() pylab.clf() def _f(self,p, x): """Evaluate function with parameters, p, and array of x values""" return p[0]*(1.0+p[1]*p[2]*x)**(-1.0/p[1]) def _errf(self,p, x, y): """return the difference between calculated and input y values""" return self._f(p, x) - y def _r2(self,x,y,abc): """calculate the correlation coefficient and rmsd""" y_calc = self._f(abc,self.x) y_avg = y.mean() rmsd = ((y-self._f(abc,x))**2).sum()/((y_calc-y_avg)**2).sum() return (1.0-rmsd, rmsd) def list_toarray(data): """Convert input list of data pairs to two arrays""" x = [] y = [] for i in data: x.append(float(i[0])) y.append(float(i[1])) x = np.array(x) y = np.array(y) return x,y def test(): def f(p, x): # for testing, make the objective function available return p[0]*(1.0+p[1]*p[2]*x)**(-1.0/p[1]) # fake data for testing, should get perfect correlation print "\nTest 1: should find an exact solution = [1.25,0.75,0.25]" print "(No plot)" qi, b, di = [1.25,0.75,0.25] sample = [] for i in range(1,61): sample.append([float(i),qi*(1.0+b*di*float(i))**(-1.0/b)]) x, y = list_toarray(sample) ex1 = HypObj(x,y) if ex1.success==1: print ex1.parameters,ex1.r2 else: print "leastsq() error: ",ex1.success # generate data with random deviations print "\nTest 2: should find a solution close to [15.0, 2.5, 0.3]" print "(No plot)" n = 151 x = np.mgrid[1:10:n*1j] abc = [15.0, 2.5, 0.3] y = f(abc, x) + rand(n) - 0.5 ex2 = HypObj(x,y,guess=[30.0, 4.0, 1.0]) if ex2.success==1: print ex2.parameters,ex2.r2 else: print "leastsq() error: ",ex2.success # real data print "\nTest 3: real world data" rn115604=[[1, 3233],[2, 3530],[3, 3152],[4, 2088],[6, 3038], [7, 2108],[8, 2132],[9, 1654],[10, 1762],[11, 1967], [12, 1760],[13, 1649],[14, 1633],[15, 1680],[16, 1398], [17, 1622],[18, 1393],[19, 1436],[20, 1352],[21, 1402], [22, 1459],[23, 1373],[24, 1262],[25, 1346],[26, 1325], [27, 1319],[28, 1309],[29, 1206],[30, 1249],[31, 1446], [32, 1255],[33, 1227],[34, 1268],[35, 1233],[36, 1175], [37, 1129],[38, 1242],[39, 1247],[40, 1198],[41, 1058], [42, 1172],[43, 1242],[44, 1214],[45, 1148],[46, 1689], [47, 971],[48, 1084],[49, 1028],[50, 1164],[51, 1297], [52, 1040],[53, 1045],[54, 1196],[55, 991],[56, 1065], [57, 898],[58, 1020],[59, 966],[60, 1162],[61, 1069], [62, 1055],[63, 1035],[64, 1045],[65, 1076],[66, 1108], [67, 918],[68, 1051],[69, 1049],[70, 1039],[71, 1133], [72, 887],[73, 924],[74, 983],[75, 1077],[76, 1092], [77, 973],[78, 920],[79, 1040]] x, y = list_toarray(rn115604) ex3 = HypObj(x,y,plot=1) if ex3.success==1: print ex3.parameters,ex3.r2 else: print "leastsq() error: ",ex3.success if __name__=="__main__": test() Brandon C. Nuttall BNUTTALL@UKY.EDU Kentucky Geological Survey (859) 257-5500 University of Kentucky (859) 257-1147 (fax) 228 Mining & Mineral Resources Bldg http://www.uky.edu/KGS/home.htm Lexington, Kentucky 40506-0107
Brandon Nuttall wrote:
Folks,
Thanks to Alok Singhal and Robert Kern I have not only learned a great deal about SciPy and NumPy, but I have code that works. Thanks for the tip on not looping; it does make cleaner code. I have two issues: 1) there must be a better way to convert a list of data pairs to two arrays,
xy = array([[x0, y0], [x1, y1], ...]) x = xy[:,0] y = xy[:,1] -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On 28/02/07: 15:51, Brandon Nuttall wrote:
import numpy as np from scipy import rand from scipy.optimize import leastsq import pylab
class HypObj: """Object for best fit to hyperbolic function""" def __init__(self,x,y,guess=None,plot=None): self.x = x self.y = y self.plot=plot if guess==None: self.guess = [y.max(),1.0,0.5] else: self.guess = guess self.parameters, self.success = leastsq(self._errf, self.guess[:], args = (self.x, self.y)) self.r2 = self._r2(self.x,self.y,self.parameters) if self.plot<>None:
Maybe if self.plot is not None: is better? See http://www.python.org/dev/peps/pep-0008/, section 'Programming Recommendations'.
def list_toarray(data): """Convert input list of data pairs to two arrays""" x = [] y = [] for i in data: x.append(float(i[0])) y.append(float(i[1])) x = np.array(x) y = np.array(y) return x,y
For your data, you can do: x = [data[i][0] for i in range(60)] y = [data[i][1] for i in range(60)] If you want to use numpy, then you could do: data = np.asarray(data) x = data[:, 0] y = data[:, 1]
def test():
def f(p, x): # for testing, make the objective function available return p[0]*(1.0+p[1]*p[2]*x)**(-1.0/p[1])
# fake data for testing, should get perfect correlation print "\nTest 1: should find an exact solution = [1.25,0.75,0.25]" print "(No plot)" qi, b, di = [1.25,0.75,0.25] sample = [] for i in range(1,61): sample.append([float(i),qi*(1.0+b*di*float(i))**(-1.0/b)])
You don't need the loop: sample = np.zeros((60, 2), dtype=float) sample[:, 0] = np.arange(60) + 1 sample[:, 1] = qi*(1. + b*di*sample[:, 0])**(-1.0/b) Given that you 'unpack' sample at a later stage anyway, you could as well use two different variables instead of sample. Cheers, Alok -- Alok Singhal * * Graduate Student, dept. of Astronomy * * * University of Virginia http://www.astro.virginia.edu/~as8ca/ * *
On 2/28/07, Brandon Nuttall <bnuttall@uky.edu> wrote:
Folks,
Thanks to Alok Singhal and Robert Kern I have not only learned a great deal about SciPy and NumPy, but I have code that works. Thanks for the tip on not looping; it does make cleaner code. I have two issues: 1) there must be a better way to convert a list of data pairs to two arrays, 2) I'm not sure of a graceful way to transition from one plot to the next and then close.
To add to the cacophony of coding and style suggestions. The <> operator is likely to be removed in the future, so you should use '!='. My personal preference would be to move the plotting functionality to a method so that initializing the data is separate from acting on the data. I find this to be a helpful distinction as I usually don't want to plot at the time of construction, but your mileage may vary. Lastly, you can wrap the non-plotting portion of the test function into a doctest (<http://www.python.org/doc/lib/module-doctest.html>) which would then both serve as an example in the code and as a correctness test which is easy to check when other things change, like upgrades to numpy and scipy. I find this to be immensely helpful. Have fun! Alex
participants (4)
-
Alexander Michael
-
Alok Singhal
-
Brandon Nuttall
-
Robert Kern