[Numpy-discussion] Covariance matrix from polyfit

Thu Jul 4 20:01:17 EDT 2013

Hi all,

I recently adopted python, and am in the process of replacing my old 
analysis tools. For simple (e.g., linear) interpolations and 
extrapolations, in the past I used gnuplot. Today I set up the 
equivalent with polyfit in numpy v1.7.1, first running a simple test to 
reproduce the gnuplot result.

A discussion on this list back in February alerted me that I should use 
1/sigma for the weights in polyfit as opposed to 1/sigma**2. Fine -- 
that's not what I'm used to, but I can make a note.
http://mail.scipy.org/pipermail/numpy-discussion/2013-February/065649.html

Another issue mentioned in that thread is scaling the covariance matrix by
fac = resids / (len(x) - order - 2.0)
This wreaked havoc on the simple test I mentioned above (and include 
below), which fit three data points to a straight line. I spent hours 
trying to figure out why numpy returned negative variances, before 
tracking down this line. And, indeed, if I add a fake fourth data point, 
I end up with inf's and nan's.

There is some lengthy justification for subtracting that 2 around line 
590 in lib/polynomial.py. Fine -- it's nothing I recall seeing before 
(and I removed it from my local installation), but I'm just a new user.

However, I do think it is important to fix polyfit so that it doesn't 
produce pathological results like those I encountered today. Here are a 
couple of possibilities that would let the subtraction of 2 remain:
* Check whether len(x) > order + 2, and if it is not, either
** Die with an error
** Scale by resids / (len(x) - order) instead of resids / (len(x) - 
order - 2.0)

* Don't bother with this scaling at all, leaving it to the users (who 
can subtract 2 if they want). This is what scipy.optimize.leastsq does, 
after what seems to be a good deal of discussion: "This matrix must be 
multiplied by the residual variance to get the covariance of the 
parameter estimates – see curve_fit."
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html
https://github.com/scipy/scipy/pull/448

I leave it to those of you with more numpy experience to decide what 
would be the best way to go.

Cheers,
David
http://www-hep.colorado.edu/~schaich/

+++
Here's the simple example:
 >>> import numpy as np
 >>> m = np.array([0.008, 0.01, 0.015])
 >>> dat = np.array([1.0822582, 1.0805417, 1.0766624])
 >>> weight = np.array([1/0.000370, 1/0.000355, 1/0.000249])
 >>> out, cov = np.polyfit(m, dat, 1, full=False, w=weight, cov=True)
 >>> print out, '\n', cov
[-0.79269957 1.08854252]
[[ -2.34965006e-04 2.84428412e-06]
[ 2.84428412e-06 -3.66283662e-08]]
 >>> print np.sqrt(-1. * cov[0][0])
0.0153285682895
 >>> print np.sqrt(-1. * cov[1][1])
0.000191385386578
+++

Gnuplot gives
+++
Final set of parameters Asymptotic Standard Error
======================= ==========================
A = -0.792719 +/- 0.01533 (1.934%)
B = 1.08854 +/- 0.0001914 (0.01758%)
+++
so up to the negative sign, all is good.

For its part, scipy.optimize.leastsq needs me to do the scaling:
+++
 >>> import numpy as np
 >>> from scipy import optimize
 >>> m = np.array([0.008, 0.01, 0.015])
 >>> dat = np.array([1.0822582, 1.0805417, 1.0766624])
 >>> err = np.array([0.000370, 0.000355, 0.000249])
 >>> linear = lambda p, x: p[0] * x + p[1]
 >>> errfunc = lambda p, x, y, err: (linear(p, x) - y) / err
 >>> p_in = [-1., 1.]
 >>> all_out = optimize.leastsq(errfunc, p_in[:], args=(m, dat, err), 
full_output = 1)
 >>> out = all_out[0]
 >>> cov = all_out[1]
 >>> print out, '\n', cov
[-0.79269959 1.08854252]
[[ 3.40800756e-03 -4.12544212e-05]
[ -4.12544212e-05 5.31270007e-07]]
 >>> chiSq_dof = ((errfunc(out, m, dat, err))**2).sum() / (len(m) - 
len(out))
 >>> cov *= chiSq_dof
 >>> print cov
[[ 2.34964190e-04 -2.84427528e-06]
[ -2.84427528e-06 3.66282716e-08]]
+++