Re: [SciPy-user] Polyfit may be poorly conditioned

26 Mar 2008

      On 26/03/2008, massimo sandal <massimo.sandal@unibo.it> wrote:
...
I am writing an algorithm that determines the best-fitting polynomial to
 a data set[1]. The algorithm tries to fit polynomials between 1st and
 12th degree, and chooses the polynomial with the least mean square error.
It works really well now, however when it occurs to prefer high-grade
 polynomials (say, 8th-9th grade and more), I often see this warning:
/usr/lib/python2.5/site-packages/numpy/lib/polynomial.py:305:
 RankWarning: Polyfit may be poorly conditioned
   warnings.warn(msg, RankWarning)
I have poor knowledge on polynomial fitting, but if I remember
 correctly, it means that the fit is very sensitive to tiny shifts in the
 data values (unstable)? Should I worry anyway or, as long as it gives me
 sensible results, go along?
You should worry. That message indicates that the least-squares
fitting is trying to solve a linear problem that is "ill-conditioned",
that is, for which the solution depends very sensitively on the input.
In combination with numerical errors, this can be a serious problem
(though the fact that you're smoothing should reduce the issues).
There are two things going on here.

First, high-order polynomials like to thrash wildly. This is basically
because polynomials are very "stiff": they are very smooth, and every
coefficient depends strongly on all the data. So if there's a kink at
one place in your data (say), all your coefficients get bent out of
shape trying to accomodate it. This can give polynomials extremely
erratic behaviour even before the fit becomes ill-conditioned; it also
means that the polynomials are going to become ill-conditioned quite
soon.

The second issue is that representing polynomials as a0 + a1*x +
a2*x**2 + ... + an*x**n is not necessarily the best approach. If, say,
your x value lies between -1 and +1, and your data lies between -1 and
+1, then getting a polynomial to fit will require very large
coefficients which lead to delicate cancellation. Numerical errors
begin to be a problem surprisingly soon. A partial solution is to
write your polynomials differently: instead of representing them as a
linear combination of 1, x, x**2, ..., x**n, it sometimes helps to
represent them as a linear combination of a set of orthogonal
polynomials (Chebyshev polynomials, for example). This won't help if,
like scipy, the implementation of orthogonal polynomials breaks down,
but in fact the Chebyshev polynomials have a nice clean representation
as cos(n*arccos(x)) (IIRC; look it up before using!). I have found
that using this sort of representation can drastically increase the
degree of polynomial I can fit before numerics become a problem. You
can't use polyfit any more, unfortunately (though at some point
someone might write a version of the polynomial class that used a
different basis), but it's not too hard to write the least-squares
fitting using scipy.linalg.lstsq (plus or minus a few vowels).

Since you are computing polynomial coefficients (which is unstable)
only to go back and compute the polynomial values at the same points
you fit, it should be possible to set things up so the numerical
instabilities are not really a problem - if some linear combination of
polynomials gives nearly zero values, then you might as well just take
it to be zero. If you want to look into this, you might try
implementing your own polynomial fitting using the scipy.linalg.svd,
which will give you control over how you deal with this situation.
...
[1]this is needed to "flatten" the data set from systematic, irregular
 low-frequency interferences.
You might also think about whether you need to be subtracting
polynomials. scipy's smoothing splines are extremely useful, and much
more stable, numerically, than fitting polynomials. (This is primarily
because they're less affected by distant data points.) You might also
consider fitting a few sinusoids of prescribed frequencies (this may
have easier-to-understand spectral properties).

I should mention that low-frequency "red" noise can pose serious
problems for spectral analysis - if you're planning to do a Fourier
transform later it's possible that strong low-frequency components may
"leak" into your high frequency bins, contaminating them. Deeter and
Boynton 1982 talk about how to analyze data with this kind of
contamination, in the context of pulsar timing.

In all I'd say your easiest solution is to use scipy's smoothing
splines with a lot of smoothing. If you have to use polynomials, try
using (scaled) Chebyshev polynomials as your basis. Only if these both
fail is it worth going into some of my more exotic solutions.

Good luck,
Anne

Re: [SciPy-user] Polyfit may be poorly conditioned

Anne Archibald