[SciPy-User] adding linear fitting routine

Matt Newville newville at cars.uchicago.edu
Wed Dec 4 13:24:12 EST 2013


Hi David, Josef,

On Wed, Dec 4, 2013 at 12:15 PM,  <josef.pktd at gmail.com> wrote:
> On Wed, Dec 4, 2013 at 12:53 PM, David J Pine <djpine at gmail.com> wrote:
>> Josef,
>>
>> Actually, I just rechecked curve_fit and it returns only the optimal fitting
>> parameters and the covariance matrix.  I could pare down linfit so that it
>> returns only those quantities and leave it to the user to calculate
>> chi0squared and the residuals.  I suppose that's the cleanest way to go.
>>
>> David
>>
>>
>> On Wed, Dec 4, 2013 at 6:47 PM, David J Pine <djpine at gmail.com> wrote:
>>>
>>> Josef,
>>>
>>> Ok, so what would you propose?  That we essentially replace linregress
>>> with linfit, and then let people calculate std_err and pvalue themselves
>>> from the covariance matrix that `linfit` returns?  or something else?  By
>>> the way, that's what I chose to do for the estimates of the uncertainties in
>>> the fitting parameters--to let the user calculate the uncertainties in the
>>> fitting parameters from square roots the diagonal elements of the covariance
>>> matrix.  In my opinion, that results in a cleaner less cluttered function.
>
> Please reply inline so we have the sub-threads together.
>
> two thoughts:
>
> - I'm getting more and more averse to functions that return "numbers"
>   scipy.optimize minimize returns a dictionary
>   In statsmodels we return a special class instance, that can
> calculate lazily all the extra things a user might want.
>   And were we don't do that yet like in hypothesis test, we want to
> change it soon.
>   The two main problems with returning numbers are that it cannot be
> changed in a backwards compatible way, and, second, if we want to
> offer a user to calculate additional optional results, then we need
> return_this, return_that, return_something_else, ....
>
> - The main usage of stats.linregress that I have seen in random looks
> at various packages, is just to get quick fit of a line without (m)any
> extras. In this case just returning the parameters and maybe some
> other minimal cheap extras is fine.
>
>
> I don't know if we want linfit to provide a one-stop shopping center,
> or just to provide some minimal results and leave the rest to the
> user.
>
> (In statsmodels I also often don't know what I should do. I follow the
> scipy tradition and return some numbers, only to change my mind later
> when I see what additional results could be easily calculated within
> the function, but I don't get access to the required calculations.)
>
> Josef
>
>>>
>>> David
>>>
>>> David
>>>
>>>
>>> On Wed, Dec 4, 2013 at 6:03 PM, <josef.pktd at gmail.com> wrote:
>>>>
>>>> On Wed, Dec 4, 2013 at 11:29 AM, David Pine <djpine at gmail.com> wrote:
>>>> >
>>>> > On Dec 4, 2013, at 3:24 PM, josef.pktd at gmail.com wrote:
>>>> >
>>>> >
>>>> > linfit looks like an enhanced version of linregress, which also has
>>>> > only one regressor, but doesn't have weights
>>>> >
>>>> > http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.linregress.html
>>>> >
>>>> >
>>>> > The problem with this is that the statistical tests that linregress
>>>> > uses--the r-value, p-value, & stderr-- are not really compatible with
>>>> > the
>>>> > weighted chi-squared fitting that linfit does.  The  r-value, p-value,
>>>> > &
>>>> > stderr are statistical tests that are used mostly in the social
>>>> > sciences
>>>> > (see http://en.wikipedia.org/wiki/Coefficient_of_determination).
>>>> > Looking at
>>>> > linregress, it's clear that it was written with that community in mind.
>>>> >
>>>> > By contrast, linfit (and curve_fit) use the chi-squared measure of
>>>> > goodness
>>>> > of fit, which is explicitly made to be used with weighted data.  In my
>>>> > opinion, trying to satisfy the needs of both communities with one
>>>> > function
>>>> > will result in inefficient code and confusion in both user communities.
>>>> > linfit naturally goes with the curve_fit and polyfit functions, and is
>>>> > implemented consistent with those fitting routines.  linregress is
>>>> > really a
>>>> > different animal, with statistical tests normally used with unweighted
>>>> > data,
>>>> > and I suspect that the community that uses it will be put off by the
>>>> > "improvements" made by linfit.
>>>>
>>>> except for setting absolute_sigma to True or relsigma to False and
>>>> returning redchisq instead of rsquared, there is no real difference.
>>>> It's still just weighted least squares with fixed or estimated scale.
>>>> (In statsmodels we have most of the same statistics returned after WLS
>>>> as after OLS. However, allowing for a fixed scale is still not built
>>>> in.)
>>>>
>>>> You still return the cov of the parameter estimates, so users can
>>>> still calculate std_err and pvalue themselves in `linfit`.
>>>>
>>>> In my interpretation of the discussions around curve_fit, it seems to
>>>> me that it is now a version that both communities can use.
>>>> The only problem I see is that linfit/linregress get a bit ugly if
>>>> there are many optional returns.
>>>>
>>>> Josef
>>>>
>>>> >
>>>> >
>>>> > relsigma is similar to the new `absolute_sigma` in curve_fit
>>>> > https://github.com/scipy/scipy/pull/3098
>>>> >
>>>> >
>>>> > That's right.  linfit implements essentially the same functionality
>>>> > that is
>>>> > being implemented in curve_fit
>>>> >
>>>> >
>>>> >
>>>> > I think linregress could be rewritten to include these improvements.
>>>> >
>>>> > Otherwise I keep out of any fitting debates, because I think `odr` is
>>>> > better for handling measurement errors in the x variables, and
>>>> > statsmodels is better for everything else (mainly linear only so far)
>>>> > and `lmfit` for nonlinear LS.
>>>> > There might be a case for stripped down convenience functions or
>>>> > special case functions.
>>>> >
>>>> > Josef
>>>> >
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > SciPy-User mailing list
>>>> > SciPy-User at scipy.org
>>>> > http://mail.scipy.org/mailman/listinfo/scipy-user
>>>> >
>>>> _______________________________________________
>>>> SciPy-User mailing list
>>>> SciPy-User at scipy.org
>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>>
>>>
>>
>>
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user
>>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>

I'd very much like to see having linfit() available in scipy.
Would it be reasonable to add David's linfit() as "the more complete"
version, and refactor linregress() to use linfit() and return it's
current return tuple derived from the linfit results?    Perhaps that
what Josef is suggesting too.  FWIW, I would generally prefer getting
a dictionary of results as a return value instead of a tuple with more
than 2 items.

--Matt Newville



More information about the SciPy-User mailing list