[SciPy-User] ancova with optimize.curve_fit

Tue Dec 7 16:58:41 EST 2010

On 12/06/2010 10:00 PM, Peter Tittmann wrote:
>
> Gentlemen,
>
> I've decided to switch to the OLS method, thought I did get the NNLS 
> method that Skipper proposed working. I was not prepared to spend more 
> time trying to make sense of the resulting array for ancova, etc. 
> (also could not figure out how to interpret the resulting coefficient 
> array as I was expecting a 2d array representing the a and b 
> coefficient values but it returned a 1d array). I have hopefully 
> simple follow up questions:
>
> 1. Is there a method to define explicitly the function used in OLS? I 
> know numpy.linalg.lstsq is the way OLS works but is there another 
> function where I can define the form?
>
> 2. I'm still interested in interpreting the results of the NNLS 
> method, so if either of you can suggest what the resulting arrays mean 
> id be grateful. I've attached the output of NNLS
>
> warm regards,
>
> Peter
>
>
> Here is the working version of NNLS:
>
> def getDiam2(ht,*b):
>     return b[0] * ht[:,1]**b[1] + np.sum(b[2:]*ht[:,2:], axis=1)
>
> dt = np.genfromtxt('/home/peter/Desktop/db_out.csv', delimiter=",", 
> names=True, dtype=None)
> indHtPlot = adt['height']
> depDbh = adt['dbh']
> plot_dummies, col_map = sm.tools.categorical(dt['plot], drop=True, 
> dictnames=True)
>
>
> def nnlsDummies():
>     '''this function returns coefficients and covariance arrays'''
>     plot_dummies, col_map = sm.tools.categorical(indPlot, drop=True, 
> dictnames=True)
>     X = np.column_stack((indHt, plot_dummies))
>     Y = depDbh
>     coefs, cov = curve_fit(getDiam2, X, Y, p0= [0.]*X.shape[1])
>     return coefs, cov
>
>
> -- 
> Peter Tittmann
>
>
> On Monday, December 6, 2010 at 4:55 PM, josef.pktd at gmail.com wrote:
>
>> On Mon, Dec 6, 2010 at 7:41 PM, Skipper Seabold <jsseabold at gmail.com 
>> <mailto:jsseabold at gmail.com>> wrote:
>>> On Mon, Dec 6, 2010 at 7:31 PM, Peter Tittmann wrote:
>>> > thanks both of you,
>>> > Josef, the data that I sent is only the first 100 rows of about 
>>> 1500, there
>>> > should be sufficient sampling in each plot.
>>> > Skipper, I have attempted to deploy your suggestion for not 
>>> linearizing the
>>> > data. It seems to work. I'm a little confused at your modification 
>>> if the
>>> > getDiam function and I wonder if you could help me understand. The 
>>> form of
>>> > the equation that is being fit is:
>>> > Y= a*X^b
>>> > your version of the detDaim function:
>>> >
>>> > def getDiam(ht, *b):
>>> >    return ht[:,0]**b[0] + np.sum(b[1:]*ht[:,1:], axis=1)
>>> >
>>> > Im sorry if this is an obvious question but I don't understand how 
>>> this
>>> > works as it seems that the "a" coefficient is missing.
>>> > Thanks again!
>>>
>>> Right.  I took out the 'a', because as I read it when I linearized (I
>>> might be misunderstanding ancova, I never recall the details), if you
>>> include 'a' and also all of the dummy variables for the plot, then you
>>> will have a the problem of multicollinearity.  You could also include
>>> 'a' and drop one of the plot dummies, but then 'a' is just your
>>> reference category that you dropped.  So now b[0] is the nonlinear
>>> effect of your main variable and b[1:] contains linear shift effects
>>> of all the plots.  Hmm, thinking about it some more, though I think
>>> you could include 'a' in the non-linear version above (call it b[0]
>>> and shift everything else over by one), because now 'a' would be the
>>> effect when the current b[0] is zero.  I was just unsure how you meant
>>> 'a' when you had a*ht**b and were trying to include in ht the plot
>>> variable dummies.
>>
>> As I understand it, the intention is to estimate equality of the slope
>> coefficients, so the continuous variable is multiplied with the dummy
>> variables. In this case, the constant should still be added. The
>> normalization question is whether to include all dummy-cont.variable
>> products and drop the continuous variable, or include the continuous
>> variable and drop one of the dummy-cont levels.
>>
>> Unless there is a strong reason to avoid log-normality of errors, I
>> would work (first) with the linear version.
>>
>> Josef
>>
>>
>>>
>>> Skipper
>>> _______________________________________________
>>> SciPy-User mailing list
>>> SciPy-User at scipy.org <mailto:SciPy-User at scipy.org>
>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org <mailto:SciPy-User at scipy.org>
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
I do think this is starting to be an off-list discussion because this is 
really about statistics and not about numpy/scipy (you can contact me 
off-list if you want).

I am not sure what all the variables are so please excuse me but I 
presume you want to model dbh as a function of height, plot and species. 
Following usual biostatistics interpretation, 'plot' is probably treated 
as random effect but you probably have to use R/SAS etc for that for 
both linear and nonlinear models or some spatial models.

Really you need to determine whether or not a nonlinear model is 
required. With the few data points you provided, I only see a linear 
relationship between dbh and height with some outliers and perhaps some 
heterogeneity of variance. Often doing a simple polynomial/spline can 
help to see if there is any evidence for a nonlinear relationship in the 
full data - a linear model or polynomial with the data provided does not 
suggest a nonlinear model. Obviously a linear model is easier to fit and 
interpret especially if you create the design matrix as estimable 
functions (which is rather trivial once you understand using dummy 
variables).

The most general nonlinear/multilevel model proposed is of the form:
dbh= C + A*height^B
Obviously if B=1 then it is a linear model and the parameters A, B and C 
can be modeled with a linear function of intercept, plot and species. 
Although, if 'plot' is what I think it is then you probably would not 
model the parameters A and B with it.

Without C you are forcing the curve through zero which is biological 
feasible if you expect dbh=0 when height is zero. However, dbh can be 
zero if height is not zero just due to the model itself or what dbh 
actually is (it may take a minimum height before dbh is greater than 
zero). With the data you provided, there are noticeable differences 
between species for dbh and height so you probably need C in your model.

For this general model you probably should just fit the curve for each 
species alone but I would use a general stats package to do this. This 
will give you a good starting point to know how well the curve fits each 
species as well as the similarity of parameters and residual variation. 
Getting convergence with a model that has B varying across species may 
be rather hard so I would suggest modeling A and C first.

Bruce

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20101207/36dbf5c3/attachment.html>