# [Numpy-discussion] Covariance matrix from polyfit

David Schaich daschaich at gmail.com
Thu Jul 4 20:01:17 EDT 2013

```Hi all,

I recently adopted python, and am in the process of replacing my old
analysis tools. For simple (e.g., linear) interpolations and
extrapolations, in the past I used gnuplot. Today I set up the
equivalent with polyfit in numpy v1.7.1, first running a simple test to
reproduce the gnuplot result.

A discussion on this list back in February alerted me that I should use
1/sigma for the weights in polyfit as opposed to 1/sigma**2. Fine --
that's not what I'm used to, but I can make a note.
http://mail.scipy.org/pipermail/numpy-discussion/2013-February/065649.html

Another issue mentioned in that thread is scaling the covariance matrix by
fac = resids / (len(x) - order - 2.0)
This wreaked havoc on the simple test I mentioned above (and include
below), which fit three data points to a straight line. I spent hours
trying to figure out why numpy returned negative variances, before
tracking down this line. And, indeed, if I add a fake fourth data point,
I end up with inf's and nan's.

There is some lengthy justification for subtracting that 2 around line
590 in lib/polynomial.py. Fine -- it's nothing I recall seeing before
(and I removed it from my local installation), but I'm just a new user.

However, I do think it is important to fix polyfit so that it doesn't
produce pathological results like those I encountered today. Here are a
couple of possibilities that would let the subtraction of 2 remain:
* Check whether len(x) > order + 2, and if it is not, either
** Die with an error
** Scale by resids / (len(x) - order) instead of resids / (len(x) -
order - 2.0)

* Don't bother with this scaling at all, leaving it to the users (who
can subtract 2 if they want). This is what scipy.optimize.leastsq does,
after what seems to be a good deal of discussion: "This matrix must be
multiplied by the residual variance to get the covariance of the
parameter estimates – see curve_fit."
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.leastsq.html
https://github.com/scipy/scipy/pull/448

I leave it to those of you with more numpy experience to decide what
would be the best way to go.

Cheers,
David

+++
Here's the simple example:
>>> import numpy as np
>>> m = np.array([0.008, 0.01, 0.015])
>>> dat = np.array([1.0822582, 1.0805417, 1.0766624])
>>> weight = np.array([1/0.000370, 1/0.000355, 1/0.000249])
>>> out, cov = np.polyfit(m, dat, 1, full=False, w=weight, cov=True)
>>> print out, '\n', cov
[-0.79269957 1.08854252]
[[ -2.34965006e-04 2.84428412e-06]
[ 2.84428412e-06 -3.66283662e-08]]
>>> print np.sqrt(-1. * cov[0][0])
0.0153285682895
>>> print np.sqrt(-1. * cov[1][1])
0.000191385386578
+++

Gnuplot gives
+++
Final set of parameters Asymptotic Standard Error
======================= ==========================
A = -0.792719 +/- 0.01533 (1.934%)
B = 1.08854 +/- 0.0001914 (0.01758%)
+++
so up to the negative sign, all is good.

For its part, scipy.optimize.leastsq needs me to do the scaling:
+++
>>> import numpy as np
>>> from scipy import optimize
>>> m = np.array([0.008, 0.01, 0.015])
>>> dat = np.array([1.0822582, 1.0805417, 1.0766624])
>>> err = np.array([0.000370, 0.000355, 0.000249])
>>> linear = lambda p, x: p[0] * x + p[1]
>>> errfunc = lambda p, x, y, err: (linear(p, x) - y) / err
>>> p_in = [-1., 1.]
>>> all_out = optimize.leastsq(errfunc, p_in[:], args=(m, dat, err),
full_output = 1)
>>> out = all_out[0]
>>> cov = all_out[1]
>>> print out, '\n', cov
[-0.79269959 1.08854252]
[[ 3.40800756e-03 -4.12544212e-05]
[ -4.12544212e-05 5.31270007e-07]]
>>> chiSq_dof = ((errfunc(out, m, dat, err))**2).sum() / (len(m) -
len(out))
>>> cov *= chiSq_dof
>>> print cov
[[ 2.34964190e-04 -2.84427528e-06]
[ -2.84427528e-06 3.66282716e-08]]
+++

```