Polyfit may be poorly conditioned
![](https://secure.gravatar.com/avatar/13fb7d10d11fde5eef3574fb40ff225c.jpg?s=120&d=mm&r=g)
Hi, I am writing an algorithm that determines the best-fitting polynomial to a data set[1]. The algorithm tries to fit polynomials between 1st and 12th degree, and chooses the polynomial with the least mean square error. It works really well now, however when it occurs to prefer high-grade polynomials (say, 8th-9th grade and more), I often see this warning: /usr/lib/python2.5/site-packages/numpy/lib/polynomial.py:305: RankWarning: Polyfit may be poorly conditioned warnings.warn(msg, RankWarning) I have poor knowledge on polynomial fitting, but if I remember correctly, it means that the fit is very sensitive to tiny shifts in the data values (unstable)? Should I worry anyway or, as long as it gives me sensible results, go along? Thanks, Massimo [1]this is needed to "flatten" the data set from systematic, irregular low-frequency interferences. -- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi" snail mail: Via Irnerio 48, 40126 Bologna, Italy email: massimo.sandal@unibo.it tel: +39-051-2094388 fax: +39-051-2094387
![](https://secure.gravatar.com/avatar/d621fb031bd283f73762c424440766fc.jpg?s=120&d=mm&r=g)
On Wed, Mar 26, 2008 at 5:55 PM, massimo sandal <massimo.sandal@unibo.it> wrote:
Hi,
I am writing an algorithm that determines the best-fitting polynomial to a data set[1]. The algorithm tries to fit polynomials between 1st and 12th degree, and chooses the polynomial with the least mean square error.
It works really well now, however when it occurs to prefer high-grade polynomials (say, 8th-9th grade and more), I often see this warning:
/usr/lib/python2.5/site-packages/numpy/lib/polynomial.py:305: RankWarning: Polyfit may be poorly conditioned warnings.warn(msg, RankWarning)
I have poor knowledge on polynomial fitting, but if I remember correctly, it means that the fit is very sensitive to tiny shifts in the data values (unstable)? Should I worry anyway or, as long as it gives me sensible results, go along?
Thanks, Massimo
[1]this is needed to "flatten" the data set from systematic, irregular low-frequency interferences. -- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi"
snail mail: Via Irnerio 48, 40126 Bologna, Italy
email: massimo.sandal@unibo.it
tel: +39-051-2094388 fax: +39-051-2094387
Hi, Well think about it one second : A fit with a 12 order polynomial will always have a smaller least mean square error the a fit using a <12 order polynomial. Your must figure out we before using a polyfit. Moreover, "smaller least mean square error" is not the driver in our case because high order polynomials show (it is very common) large oscillations in between the data points. That is not what you want. I'm happy to see your question because the worth behavior is to use a tool blindly. Xavier Xavier Xavier
![](https://secure.gravatar.com/avatar/13fb7d10d11fde5eef3574fb40ff225c.jpg?s=120&d=mm&r=g)
Xavier Gnata ha scritto:
Well think about it one second : A fit with a 12 order polynomial will always have a smaller least mean square error the a fit using a <12 order polynomial. Your must figure out we before using a polyfit.
Well, this is reasonable but it is not what I see happening. Often the algorithm takes the highest-degree polynomial, but not always.
Moreover, "smaller least mean square error" is not the driver in our case because high order polynomials show (it is very common) large oscillations in between the data points. That is not what you want.
Yes, I know. However it seems to behave extremly well in this case -I tested on dozens of data sets and it flats them all practically perfectly for my needs. That's why I asked -should I worry *even if it works*?
I'm happy to see your question because the worth behavior is to use a tool blindly.
You probably meant "worst", however yes, I agree :) m. -- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi" snail mail: Via Irnerio 48, 40126 Bologna, Italy email: massimo.sandal@unibo.it tel: +39-051-2094388 fax: +39-051-2094387
![](https://secure.gravatar.com/avatar/d621fb031bd283f73762c424440766fc.jpg?s=120&d=mm&r=g)
On Thu, Mar 27, 2008 at 12:57 PM, massimo sandal <massimo.sandal@unibo.it> wrote:
Xavier Gnata ha scritto:
Well think about it one second : A fit with a 12 order polynomial will always have a smaller least mean square error the a fit using a <12 order polynomial. Your must figure out we before using a polyfit.
Well, this is reasonable but it is not what I see happening. Often the algorithm takes the highest-degree polynomial, but not always.
Hum we have a problem here: Polyfit is a linear fit. It can (should?) be implemented using using the pseudo-inverse paradigm and it "pseudo-inverse provide you with the solution that minimize the chi2. As a result, a 12 ordrer fit cannot be worst than a 5 order. It should at least end up with the same chi2. One trivial exemple : In [9]: polyfit(arange(1000),2*arange(1000)+1,30) C:\Python25\lib\site-packages\numpy\lib\polynomial.py:306: RankWarning: Polyfit may be poorly conditioned warnings.warn(msg, RankWarning) Out[9]: array([ 4.25295067e-93, -1.90319088e-89, 2.47204987e-86, 3.64986075e-84, -2.04058052e-80, -1.02156478e-77, 1.38333948e-74, 1.84191780e-71, -1.04858370e-69, -1.93210697e-65, -1.30453334e-62, 1.07062138e-59, 2.11173521e-56, 2.00842417e-54, -2.16681833e-50, -1.21709485e-47, 1.99280602e-44, 1.67639356e-41, -2.35550438e-38, -1.10826788e-35, 3.53529389e-32, -2.99768374e-29, 1.44456488e-26, -4.49042306e-24, 9.31940018e-22, -1.28887179e-19, 1.15776067e-17, -6.43505943e-16, 2.04266284e-14, 2.00000000e+00, 1.00000000e+00]) Of course, we also have numerical errors but if they cannot be neglected, then to have found a problem which is really ill-conditioned (hudge range of values ? values very close to inf or 0 ?)...or there is a pb in poyfit but I cannot see this bug. Any comments? Xavier
Moreover, "smaller least mean square error" is not the driver in our case because high order polynomials show (it is very common) large oscillations in between the data points. That is not what you want.
Yes, I know. However it seems to behave extremly well in this case -I tested on dozens of data sets and it flats them all practically perfectly for my needs. That's why I asked -should I worry *even if it works*?
I'm happy to see your question because the worth behavior is to use a tool blindly.
You probably meant "worst", however yes, I agree :)
m.
-- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi"
snail mail: Via Irnerio 48, 40126 Bologna, Italy
email: massimo.sandal@unibo.it
tel: +39-051-2094388 fax: +39-051-2094387
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/ab7e74f2443b81e5175638d72be65e07.jpg?s=120&d=mm&r=g)
On 28/03/2008, Xavier Gnata <xavier.gnata@gmail.com> wrote:
Hum we have a problem here: Polyfit is a linear fit. It can (should?) be implemented using using the pseudo-inverse paradigm and it "pseudo-inverse provide you with the solution that minimize the chi2. As a result, a 12 ordrer fit cannot be worst than a 5 order. It should at least end up with the same chi2.
One trivial exemple : In [9]: polyfit(arange(1000),2*arange(1000)+1,30) C:\Python25\lib\site-packages\numpy\lib\polynomial.py:306: RankWarning: Polyfit may be poorly conditioned warnings.warn(msg, RankWarning)
Out[9]: array([ 4.25295067e-93, -1.90319088e-89, 2.47204987e-86, 3.64986075e-84, -2.04058052e-80, -1.02156478e-77, 1.38333948e-74, 1.84191780e-71, -1.04858370e-69, -1.93210697e-65, -1.30453334e-62, 1.07062138e-59,
2.11173521e-56, 2.00842417e-54, -2.16681833e-50, -1.21709485e-47, 1.99280602e-44, 1.67639356e-41, -2.35550438e-38, -1.10826788e-35, 3.53529389e-32, -2.99768374e-29, 1.44456488e-26, -4.49042306e-24,
9.31940018e-22, -1.28887179e-19, 1.15776067e-17, -6.43505943e-16, 2.04266284e-14, 2.00000000e+00, 1.00000000e+00]) Of course, we also have numerical errors but if they cannot be neglected, then to have found a problem which is really ill-conditioned (hudge range of values ? values very close to inf or 0 ?)...or there is a pb in poyfit but I cannot see this bug.
Any comments?
What exactly is the problem here? This is the best fit that could be hoped for - the linear and constant terms are right, and the rest are zero to the best accuracy we can plausibly expect. If you are using a goodness-of-fit measure that takes into account the number of parameters fitted (Bayesian model comparison, say, or even a chi-squared that takes into account "degrees of freedom") this should be reported as a much worse fit than using only a linear polynomial. The problem is ill-conditioned because 1000**30 is so much bigger than 999**30, as is 1000**29 compared to 999**29, that the vectors arange(1000)**30 and arange(1000)**30 are nearly identical. As the polyfit docstring says, this is a really horrible basis to use for polynomial fitting. Since polyfit is quite sensibly based on the SVD, when it encounters some linear combination of basis vectors that nearly cancels, rather than introduce enormous coefficients to try to use this combination for fitting, it just discards it. This results in slightly poorer fits sometimes, but drastically reduces the numerical headaches that come with ill-conditioned matrices. Thus even if a higher-degree polynomial would fit the data better in a world of exact arithmetic, occasionally numerical issues force the discarding of some troublesome combinations of coefficients. I do think it would be useful to have some mechanism for fitting and working with polynomials represented in other bases, so that these numerical issues would be reduced. This could go through at the same time as an improvement (by specialization) of some of scipy's orthogonal polynomial functions. But for now I can see no real problem with polyfit. Anne
![](https://secure.gravatar.com/avatar/d621fb031bd283f73762c424440766fc.jpg?s=120&d=mm&r=g)
Anne Archibald wrote:
On 28/03/2008, Xavier Gnata <xavier.gnata@gmail.com> wrote:
Hum we have a problem here: Polyfit is a linear fit. It can (should?) be implemented using using the pseudo-inverse paradigm and it "pseudo-inverse provide you with the solution that minimize the chi2. As a result, a 12 ordrer fit cannot be worst than a 5 order. It should at least end up with the same chi2.
One trivial exemple : In [9]: polyfit(arange(1000),2*arange(1000)+1,30) C:\Python25\lib\site-packages\numpy\lib\polynomial.py:306: RankWarning: Polyfit may be poorly conditioned warnings.warn(msg, RankWarning)
Out[9]: array([ 4.25295067e-93, -1.90319088e-89, 2.47204987e-86, 3.64986075e-84, -2.04058052e-80, -1.02156478e-77, 1.38333948e-74, 1.84191780e-71, -1.04858370e-69, -1.93210697e-65, -1.30453334e-62, 1.07062138e-59,
2.11173521e-56, 2.00842417e-54, -2.16681833e-50, -1.21709485e-47, 1.99280602e-44, 1.67639356e-41, -2.35550438e-38, -1.10826788e-35, 3.53529389e-32, -2.99768374e-29, 1.44456488e-26, -4.49042306e-24,
9.31940018e-22, -1.28887179e-19, 1.15776067e-17, -6.43505943e-16, 2.04266284e-14, 2.00000000e+00, 1.00000000e+00]) Of course, we also have numerical errors but if they cannot be neglected, then to have found a problem which is really ill-conditioned (hudge range of values ? values very close to inf or 0 ?)...or there is a pb in poyfit but I cannot see this bug.
Any comments?
What exactly is the problem here? This is the best fit that could be hoped for - the linear and constant terms are right, and the rest are zero to the best accuracy we can plausibly expect. If you are using a goodness-of-fit measure that takes into account the number of parameters fitted (Bayesian model comparison, say, or even a chi-squared that takes into account "degrees of freedom") this should be reported as a much worse fit than using only a linear polynomial.
The problem is ill-conditioned because 1000**30 is so much bigger than 999**30, as is 1000**29 compared to 999**29, that the vectors arange(1000)**30 and arange(1000)**30 are nearly identical. As the polyfit docstring says, this is a really horrible basis to use for polynomial fitting.
Since polyfit is quite sensibly based on the SVD, when it encounters some linear combination of basis vectors that nearly cancels, rather than introduce enormous coefficients to try to use this combination for fitting, it just discards it. This results in slightly poorer fits sometimes, but drastically reduces the numerical headaches that come with ill-conditioned matrices. Thus even if a higher-degree polynomial would fit the data better in a world of exact arithmetic, occasionally numerical issues force the discarding of some troublesome combinations of coefficients.
I do think it would be useful to have some mechanism for fitting and working with polynomials represented in other bases, so that these numerical issues would be reduced. This could go through at the same time as an improvement (by specialization) of some of scipy's orthogonal polynomial functions. But for now I can see no real problem with polyfit.
Anne
Well, this is reasonable but it is not what I see happening. Often
Anne : Polyfit is fine. Quoting : "Well think about it one second : A fit with a 12 order polynomial will always have a smaller least mean square error the a fit using a <12 order polynomial. Your must figure out we before using a polyfit. the algorithm takes the highest-degree polynomial, but not always." This simply cannot be. Polyfit is just fine. The goal of my trivial example is to show that polyfit is fine. My goal is only to tell to Massimo what there must be a pb somewhere in its code. Xavier
![](https://secure.gravatar.com/avatar/13fb7d10d11fde5eef3574fb40ff225c.jpg?s=120&d=mm&r=g)
Gnata Xavier ha scritto:
Quoting :
"Well think about it one second : A fit with a 12 order polynomial will always have a smaller least mean square error the a fit using a <12 order polynomial. Your must figure out we before using a polyfit.
Well, this is reasonable but it is not what I see happening. Often the algorithm takes the highest-degree polynomial, but not always."
This simply cannot be. Polyfit is just fine. The goal of my trivial example is to show that polyfit is fine. My goal is only to tell to Massimo what there must be a pb somewhere in its code.
Pardon my utter ignorance, but imagine my data are just a straight line. How can a parable, or a 3rd order polynomial, fit them better than a straight line itself? It is possible that when I pick up <12 polynomials, the error is the same for the higher degrees. Since i use index() to find the "best" one in a vector of values, probably it picks the first of them. Anyway my code, as silly as it is :) , seems to work. I tested hundreds of data without a single failing. I'll make sure to use smoothed splines in the future, but it's not a priority now. m. -- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi" snail mail: Via Irnerio 48, 40126 Bologna, Italy email: massimo.sandal@unibo.it tel: +39-051-2094388 fax: +39-051-2094387
![](https://secure.gravatar.com/avatar/6401b8425eed08fcbaffffeeaceac894.jpg?s=120&d=mm&r=g)
massimo sandal wrote:
Gnata Xavier ha scritto:
Quoting :
"Well think about it one second : A fit with a 12 order polynomial will always have a smaller least mean square error the a fit using a <12 order polynomial. Your must figure out we before using a polyfit.
Well, this is reasonable but it is not what I see happening. Often the algorithm takes the highest-degree polynomial, but not always."
This simply cannot be. Polyfit is just fine. The goal of my trivial example is to show that polyfit is fine. My goal is only to tell to Massimo what there must be a pb somewhere in its code.
Pardon my utter ignorance, but imagine my data are just a straight line. How can a parable, or a 3rd order polynomial, fit them better than a straight line itself?
Just as it takes 2 points to define a line, it takes 3 points to define a parabola, and n+1 points to define an nth order polynomial. Thus, you can define a polynomial that passes through all of your data points. This will produce an RMS error of 0. However, if you know the points lie along a line, this is clearly not optimal and you are simply fitting the curve to noise in the data. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma
![](https://secure.gravatar.com/avatar/13fb7d10d11fde5eef3574fb40ff225c.jpg?s=120&d=mm&r=g)
Ryan May ha scritto:
massimo sandal wrote:
Gnata Xavier ha scritto:
Quoting : Just as it takes 2 points to define a line, it takes 3 points to define a parabola, and n+1 points to define an nth order polynomial.
Of course :)
Thus, you can define a polynomial that passes through all of your data points. This will produce an RMS error of 0. However, if you know the points lie along a line, this is clearly not optimal and you are simply fitting the curve to noise in the data.
Yes. What I meant is that if the polynomial algorithm is trying to fit a straight line (let's forget about noise) with an high-order polynomial, with strictly non-zero coefficients, I wondered if it could be worse, numerically, since the best fit requires zero coefficient for anything with order > 1. Of course if we introduce noise, the polyfit will try to intersecate noise points, so it couldn't be worse. I understand. m. -- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi" snail mail: Via Irnerio 48, 40126 Bologna, Italy email: massimo.sandal@unibo.it tel: +39-051-2094388 fax: +39-051-2094387
![](https://secure.gravatar.com/avatar/ab7e74f2443b81e5175638d72be65e07.jpg?s=120&d=mm&r=g)
On 26/03/2008, massimo sandal <massimo.sandal@unibo.it> wrote:
I am writing an algorithm that determines the best-fitting polynomial to a data set[1]. The algorithm tries to fit polynomials between 1st and 12th degree, and chooses the polynomial with the least mean square error.
It works really well now, however when it occurs to prefer high-grade polynomials (say, 8th-9th grade and more), I often see this warning:
/usr/lib/python2.5/site-packages/numpy/lib/polynomial.py:305: RankWarning: Polyfit may be poorly conditioned warnings.warn(msg, RankWarning)
I have poor knowledge on polynomial fitting, but if I remember correctly, it means that the fit is very sensitive to tiny shifts in the data values (unstable)? Should I worry anyway or, as long as it gives me sensible results, go along?
You should worry. That message indicates that the least-squares fitting is trying to solve a linear problem that is "ill-conditioned", that is, for which the solution depends very sensitively on the input. In combination with numerical errors, this can be a serious problem (though the fact that you're smoothing should reduce the issues). There are two things going on here. First, high-order polynomials like to thrash wildly. This is basically because polynomials are very "stiff": they are very smooth, and every coefficient depends strongly on all the data. So if there's a kink at one place in your data (say), all your coefficients get bent out of shape trying to accomodate it. This can give polynomials extremely erratic behaviour even before the fit becomes ill-conditioned; it also means that the polynomials are going to become ill-conditioned quite soon. The second issue is that representing polynomials as a0 + a1*x + a2*x**2 + ... + an*x**n is not necessarily the best approach. If, say, your x value lies between -1 and +1, and your data lies between -1 and +1, then getting a polynomial to fit will require very large coefficients which lead to delicate cancellation. Numerical errors begin to be a problem surprisingly soon. A partial solution is to write your polynomials differently: instead of representing them as a linear combination of 1, x, x**2, ..., x**n, it sometimes helps to represent them as a linear combination of a set of orthogonal polynomials (Chebyshev polynomials, for example). This won't help if, like scipy, the implementation of orthogonal polynomials breaks down, but in fact the Chebyshev polynomials have a nice clean representation as cos(n*arccos(x)) (IIRC; look it up before using!). I have found that using this sort of representation can drastically increase the degree of polynomial I can fit before numerics become a problem. You can't use polyfit any more, unfortunately (though at some point someone might write a version of the polynomial class that used a different basis), but it's not too hard to write the least-squares fitting using scipy.linalg.lstsq (plus or minus a few vowels). Since you are computing polynomial coefficients (which is unstable) only to go back and compute the polynomial values at the same points you fit, it should be possible to set things up so the numerical instabilities are not really a problem - if some linear combination of polynomials gives nearly zero values, then you might as well just take it to be zero. If you want to look into this, you might try implementing your own polynomial fitting using the scipy.linalg.svd, which will give you control over how you deal with this situation.
[1]this is needed to "flatten" the data set from systematic, irregular low-frequency interferences.
You might also think about whether you need to be subtracting polynomials. scipy's smoothing splines are extremely useful, and much more stable, numerically, than fitting polynomials. (This is primarily because they're less affected by distant data points.) You might also consider fitting a few sinusoids of prescribed frequencies (this may have easier-to-understand spectral properties). I should mention that low-frequency "red" noise can pose serious problems for spectral analysis - if you're planning to do a Fourier transform later it's possible that strong low-frequency components may "leak" into your high frequency bins, contaminating them. Deeter and Boynton 1982 talk about how to analyze data with this kind of contamination, in the context of pulsar timing. In all I'd say your easiest solution is to use scipy's smoothing splines with a lot of smoothing. If you have to use polynomials, try using (scaled) Chebyshev polynomials as your basis. Only if these both fail is it worth going into some of my more exotic solutions. Good luck, Anne
![](https://secure.gravatar.com/avatar/13fb7d10d11fde5eef3574fb40ff225c.jpg?s=120&d=mm&r=g)
Anne Archibald ha scritto:
You should worry. That message indicates that the least-squares fitting is trying to solve a linear problem that is "ill-conditioned", that is, for which the solution depends very sensitively on the input. In combination with numerical errors, this can be a serious problem (though the fact that you're smoothing should reduce the issues). There are two things going on here.
First, high-order polynomials like to thrash wildly. This is basically because polynomials are very "stiff": they are very smooth, and every coefficient depends strongly on all the data. So if there's a kink at one place in your data (say), all your coefficients get bent out of shape trying to accomodate it. This can give polynomials extremely erratic behaviour even before the fit becomes ill-conditioned; it also means that the polynomials are going to become ill-conditioned quite soon.
I know and understand that polynomial behaviour. Just to let you understand better what I'm trying to do: The data I'm fitting are basically a straight line, with noise, with a gentle oscillation superimposed. What I want is to remove the oscillation (a well known instrumental artefact). The fit of this line of practically pure noise acts as a "reference" for the true data set, that contains the oscillation AND the signal superimposed. So what I do is fitting the reference data, and subtracting the fit to the actual data. (If you know about AFM force spectroscopy, the "reference" is the approaching curve, the data is the retracting curve, and the oscillation is optical interference) Since the noise amplitude is very small compared to the large oscillation, spikes in the noise practically seem not to affect the fit. Of course, rare occurrences may happen where the fit goes wild, but I still haven't seen that -and a rare occurrence of that, say, 1%, is OK now.
The second issue is that representing polynomials as a0 + a1*x + a2*x**2 + ... + an*x**n is not necessarily the best approach. If, say, your x value lies between -1 and +1, and your data lies between -1 and +1, then getting a polynomial to fit will require very large coefficients which lead to delicate cancellation. Numerical errors begin to be a problem surprisingly soon. A partial solution is to write your polynomials differently: instead of representing them as a linear combination of 1, x, x**2, ..., x**n, it sometimes helps to represent them as a linear combination of a set of orthogonal polynomials (Chebyshev polynomials, for example). This won't help if, like scipy, the implementation of orthogonal polynomials breaks down, but in fact the Chebyshev polynomials have a nice clean representation as cos(n*arccos(x)) (IIRC; look it up before using!). I have found that using this sort of representation can drastically increase the degree of polynomial I can fit before numerics become a problem. You can't use polyfit any more, unfortunately (though at some point someone might write a version of the polynomial class that used a different basis), but it's not too hard to write the least-squares fitting using scipy.linalg.lstsq (plus or minus a few vowels).
Thanks for the tip! I'll look into it if I meet problems with raw polynomials.
Since you are computing polynomial coefficients (which is unstable) only to go back and compute the polynomial values at the same points you fit, it should be possible to set things up so the numerical instabilities are not really a problem - if some linear combination of polynomials gives nearly zero values, then you might as well just take it to be zero. If you want to look into this, you might try implementing your own polynomial fitting using the scipy.linalg.svd, which will give you control over how you deal with this situation.
Thanks for this tip, again.
[1]this is needed to "flatten" the data set from systematic, irregular low-frequency interferences.
You might also think about whether you need to be subtracting polynomials. scipy's smoothing splines are extremely useful, and much more stable, numerically, than fitting polynomials. (This is primarily because they're less affected by distant data points.) You might also consider fitting a few sinusoids of prescribed frequencies (this may have easier-to-understand spectral properties).
Fitting sinusoids is something I've thought about, but having the easy polynomial fit at hand, I tried that and it seems it works, apart from the annoying warning. As for splines, I've not tried that. I asked people before about splines and they told me that they are even more unstable than polynomials, however! so I didn't even try. Again, however, thank you to telling me that.
I should mention that low-frequency "red" noise can pose serious problems for spectral analysis - if you're planning to do a Fourier transform later it's possible that strong low-frequency components may "leak" into your high frequency bins, contaminating them. Deeter and Boynton 1982 talk about how to analyze data with this kind of contamination, in the context of pulsar timing.
Spectral analysis should not be needed, however this is an interesting read, surely.
In all I'd say your easiest solution is to use scipy's smoothing splines with a lot of smoothing. If you have to use polynomials, try using (scaled) Chebyshev polynomials as your basis. Only if these both fail is it worth going into some of my more exotic solutions.
Thanks a lot! I'll look more deeply if I effectively have troubles with polynomials, and I will surely think about your advices. I learned a lot! m. -- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi" snail mail: Via Irnerio 48, 40126 Bologna, Italy email: massimo.sandal@unibo.it tel: +39-051-2094388 fax: +39-051-2094387
![](https://secure.gravatar.com/avatar/38d5ac232150013cbf1a4639538204c0.jpg?s=120&d=mm&r=g)
Hi, To further Anne's reply, you really should at least rescale your explanatory term say x dividing by some 'large' constant [instead of a1 +a2*x + a3x**2, first divide x by say 100 so you effectively fit a1+a2*x/100 + a3*(x/100)**2]. This 'trick' also works well for nonlinear models that would not or tend not to converge otherwise and really does not change anything but the scale. The next step is using orthogonal polynomials as these rather nice properties and usually work well as Anne indicated. Splines and loess transformations are extremely flexible and you really should try these especially given the polynomial degree you are trying to fit. Also these are semi-parametric so the model is not as important as with polynomials and provide a 'local fit'. The latter means that you can have a different fit depending on where you are in the data. However, after saying that, much depends on the data and the information content. I would also suggest somewhat strongly that you use a better model comparison procedure than some mean square error like BIC (Bayesian information criterion) or Akaike information criterion. That way you can address model complexity - higher polynomial terms should fit the data better than models with lower polynomial terms. Regards Bruce massimo sandal wrote:
Anne Archibald ha scritto:
You should worry. That message indicates that the least-squares fitting is trying to solve a linear problem that is "ill-conditioned", that is, for which the solution depends very sensitively on the input. In combination with numerical errors, this can be a serious problem (though the fact that you're smoothing should reduce the issues). There are two things going on here.
First, high-order polynomials like to thrash wildly. This is basically because polynomials are very "stiff": they are very smooth, and every coefficient depends strongly on all the data. So if there's a kink at one place in your data (say), all your coefficients get bent out of shape trying to accomodate it. This can give polynomials extremely erratic behaviour even before the fit becomes ill-conditioned; it also means that the polynomials are going to become ill-conditioned quite soon.
I know and understand that polynomial behaviour.
Just to let you understand better what I'm trying to do: The data I'm fitting are basically a straight line, with noise, with a gentle oscillation superimposed. What I want is to remove the oscillation (a well known instrumental artefact). The fit of this line of practically pure noise acts as a "reference" for the true data set, that contains the oscillation AND the signal superimposed. So what I do is fitting the reference data, and subtracting the fit to the actual data.
(If you know about AFM force spectroscopy, the "reference" is the approaching curve, the data is the retracting curve, and the oscillation is optical interference)
Since the noise amplitude is very small compared to the large oscillation, spikes in the noise practically seem not to affect the fit. Of course, rare occurrences may happen where the fit goes wild, but I still haven't seen that -and a rare occurrence of that, say, 1%, is OK now.
The second issue is that representing polynomials as a0 + a1*x + a2*x**2 + ... + an*x**n is not necessarily the best approach. If, say, your x value lies between -1 and +1, and your data lies between -1 and +1, then getting a polynomial to fit will require very large coefficients which lead to delicate cancellation. Numerical errors begin to be a problem surprisingly soon. A partial solution is to write your polynomials differently: instead of representing them as a linear combination of 1, x, x**2, ..., x**n, it sometimes helps to represent them as a linear combination of a set of orthogonal polynomials (Chebyshev polynomials, for example). This won't help if, like scipy, the implementation of orthogonal polynomials breaks down, but in fact the Chebyshev polynomials have a nice clean representation as cos(n*arccos(x)) (IIRC; look it up before using!). I have found that using this sort of representation can drastically increase the degree of polynomial I can fit before numerics become a problem. You can't use polyfit any more, unfortunately (though at some point someone might write a version of the polynomial class that used a different basis), but it's not too hard to write the least-squares fitting using scipy.linalg.lstsq (plus or minus a few vowels).
Thanks for the tip! I'll look into it if I meet problems with raw polynomials.
Since you are computing polynomial coefficients (which is unstable) only to go back and compute the polynomial values at the same points you fit, it should be possible to set things up so the numerical instabilities are not really a problem - if some linear combination of polynomials gives nearly zero values, then you might as well just take it to be zero. If you want to look into this, you might try implementing your own polynomial fitting using the scipy.linalg.svd, which will give you control over how you deal with this situation.
Thanks for this tip, again.
[1]this is needed to "flatten" the data set from systematic, irregular low-frequency interferences.
You might also think about whether you need to be subtracting polynomials. scipy's smoothing splines are extremely useful, and much more stable, numerically, than fitting polynomials. (This is primarily because they're less affected by distant data points.) You might also consider fitting a few sinusoids of prescribed frequencies (this may have easier-to-understand spectral properties).
Fitting sinusoids is something I've thought about, but having the easy polynomial fit at hand, I tried that and it seems it works, apart from the annoying warning.
As for splines, I've not tried that. I asked people before about splines and they told me that they are even more unstable than polynomials, however! so I didn't even try. Again, however, thank you to telling me that.
I should mention that low-frequency "red" noise can pose serious problems for spectral analysis - if you're planning to do a Fourier transform later it's possible that strong low-frequency components may "leak" into your high frequency bins, contaminating them. Deeter and Boynton 1982 talk about how to analyze data with this kind of contamination, in the context of pulsar timing.
Spectral analysis should not be needed, however this is an interesting read, surely.
In all I'd say your easiest solution is to use scipy's smoothing splines with a lot of smoothing. If you have to use polynomials, try using (scaled) Chebyshev polynomials as your basis. Only if these both fail is it worth going into some of my more exotic solutions.
Thanks a lot! I'll look more deeply if I effectively have troubles with polynomials, and I will surely think about your advices. I learned a lot!
m.
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/477281aad22e2e3ba0d5b92e4491caf0.jpg?s=120&d=mm&r=g)
Although I don't do much fitting, I do do a lot with optics and if the background signal you are trying to remove arises from interference fringes then I would fit with a low number of periodic functions. In fact my procedure would be as follows do an fft on the data and create a power spectrum pick out 1-3 spectral components that describe the fringe pattern Either do a global fit with the frequency components held fixed or fit piece-wise allowing the frequency, phase, and amplitude to vary within some restricted range for that piece. Cheers Chris -----Original Message----- From: scipy-user-bounces@scipy.org on behalf of massimo sandal Sent: Thu 3/27/2008 1:26 PM To: SciPy Users List Subject: Re: [SciPy-user] Polyfit may be poorly conditioned Anne Archibald ha scritto:
You should worry. That message indicates that the least-squares fitting is trying to solve a linear problem that is "ill-conditioned", that is, for which the solution depends very sensitively on the input. In combination with numerical errors, this can be a serious problem (though the fact that you're smoothing should reduce the issues). There are two things going on here.
First, high-order polynomials like to thrash wildly. This is basically because polynomials are very "stiff": they are very smooth, and every coefficient depends strongly on all the data. So if there's a kink at one place in your data (say), all your coefficients get bent out of shape trying to accomodate it. This can give polynomials extremely erratic behaviour even before the fit becomes ill-conditioned; it also means that the polynomials are going to become ill-conditioned quite soon.
I know and understand that polynomial behaviour. Just to let you understand better what I'm trying to do: The data I'm fitting are basically a straight line, with noise, with a gentle oscillation superimposed. What I want is to remove the oscillation (a well known instrumental artefact). The fit of this line of practically pure noise acts as a "reference" for the true data set, that contains the oscillation AND the signal superimposed. So what I do is fitting the reference data, and subtracting the fit to the actual data. (If you know about AFM force spectroscopy, the "reference" is the approaching curve, the data is the retracting curve, and the oscillation is optical interference) Since the noise amplitude is very small compared to the large oscillation, spikes in the noise practically seem not to affect the fit. Of course, rare occurrences may happen where the fit goes wild, but I still haven't seen that -and a rare occurrence of that, say, 1%, is OK now.
The second issue is that representing polynomials as a0 + a1*x + a2*x**2 + ... + an*x**n is not necessarily the best approach. If, say, your x value lies between -1 and +1, and your data lies between -1 and +1, then getting a polynomial to fit will require very large coefficients which lead to delicate cancellation. Numerical errors begin to be a problem surprisingly soon. A partial solution is to write your polynomials differently: instead of representing them as a linear combination of 1, x, x**2, ..., x**n, it sometimes helps to represent them as a linear combination of a set of orthogonal polynomials (Chebyshev polynomials, for example). This won't help if, like scipy, the implementation of orthogonal polynomials breaks down, but in fact the Chebyshev polynomials have a nice clean representation as cos(n*arccos(x)) (IIRC; look it up before using!). I have found that using this sort of representation can drastically increase the degree of polynomial I can fit before numerics become a problem. You can't use polyfit any more, unfortunately (though at some point someone might write a version of the polynomial class that used a different basis), but it's not too hard to write the least-squares fitting using scipy.linalg.lstsq (plus or minus a few vowels).
Thanks for the tip! I'll look into it if I meet problems with raw polynomials.
Since you are computing polynomial coefficients (which is unstable) only to go back and compute the polynomial values at the same points you fit, it should be possible to set things up so the numerical instabilities are not really a problem - if some linear combination of polynomials gives nearly zero values, then you might as well just take it to be zero. If you want to look into this, you might try implementing your own polynomial fitting using the scipy.linalg.svd, which will give you control over how you deal with this situation.
Thanks for this tip, again.
[1]this is needed to "flatten" the data set from systematic, irregular low-frequency interferences.
You might also think about whether you need to be subtracting polynomials. scipy's smoothing splines are extremely useful, and much more stable, numerically, than fitting polynomials. (This is primarily because they're less affected by distant data points.) You might also consider fitting a few sinusoids of prescribed frequencies (this may have easier-to-understand spectral properties).
Fitting sinusoids is something I've thought about, but having the easy polynomial fit at hand, I tried that and it seems it works, apart from the annoying warning. As for splines, I've not tried that. I asked people before about splines and they told me that they are even more unstable than polynomials, however! so I didn't even try. Again, however, thank you to telling me that.
I should mention that low-frequency "red" noise can pose serious problems for spectral analysis - if you're planning to do a Fourier transform later it's possible that strong low-frequency components may "leak" into your high frequency bins, contaminating them. Deeter and Boynton 1982 talk about how to analyze data with this kind of contamination, in the context of pulsar timing.
Spectral analysis should not be needed, however this is an interesting read, surely.
In all I'd say your easiest solution is to use scipy's smoothing splines with a lot of smoothing. If you have to use polynomials, try using (scaled) Chebyshev polynomials as your basis. Only if these both fail is it worth going into some of my more exotic solutions.
Thanks a lot! I'll look more deeply if I effectively have troubles with polynomials, and I will surely think about your advices. I learned a lot! m. -- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi" snail mail: Via Irnerio 48, 40126 Bologna, Italy email: massimo.sandal@unibo.it tel: +39-051-2094388 fax: +39-051-2094387
![](https://secure.gravatar.com/avatar/13fb7d10d11fde5eef3574fb40ff225c.jpg?s=120&d=mm&r=g)
C.J.Lee@tnw.utwente.nl ha scritto:
Although I don't do much fitting, I do do a lot with optics and if the background signal you are trying to remove arises from interference fringes then I would fit with a low number of periodic functions.
In fact my procedure would be as follows do an fft on the data and create a power spectrum pick out 1-3 spectral components that describe the fringe pattern Either do a global fit with the frequency components held fixed or fit piece-wise allowing the frequency, phase, and amplitude to vary within some restricted range for that piece.
I tried something like that (first try in fact), but it does not work as smoothly as it can seem. Periodicity is not always perfect too. The polynomial is faster, much easier to implement and works better. By the way, I tried it on > 400 data curves, from different experiments. It seems to work always as expected, despite the (right) concerns. I'll look into splines, but as for now it is well usable this way. m. -- Massimo Sandal University of Bologna Department of Biochemistry "G.Moruzzi" snail mail: Via Irnerio 48, 40126 Bologna, Italy email: massimo.sandal@unibo.it tel: +39-051-2094388 fax: +39-051-2094387
![](https://secure.gravatar.com/avatar/ab7e74f2443b81e5175638d72be65e07.jpg?s=120&d=mm&r=g)
On 27/03/2008, massimo sandal <massimo.sandal@unibo.it> wrote:
Anne Archibald ha scritto:
Since you are computing polynomial coefficients (which is unstable) only to go back and compute the polynomial values at the same points you fit, it should be possible to set things up so the numerical instabilities are not really a problem - if some linear combination of polynomials gives nearly zero values, then you might as well just take it to be zero. If you want to look into this, you might try implementing your own polynomial fitting using the scipy.linalg.svd, which will give you control over how you deal with this situation.
Thanks for this tip, again.
Ah, I should have read the docstring on polyfit. In fact it works in just this way, so you can be reasonably confident that it will do the right thing. If you want to suppress the error you're seeing, try using "full=True" in your call to polyfit. This will return a number of values you don't care about, but will suppress the warning (because "rank" and "rcond" contain that information). It also contains the advice that subtracting the mean of your data before starting is a good idea.
As for splines, I've not tried that. I asked people before about splines and they told me that they are even more unstable than polynomials, however! so I didn't even try. Again, however, thank you to telling me that.
I think they must be talking about splines without smoothing. The most usual kind of spline you see is forced to go through every data point, which is clearly not going to work for noisy data. But with some cleverness, implemented in scipy's splrep, you can find splines that are a least-squares fit in a particular sense (something like the straightest curve passing within one sigma of your data). In my experience they are extremely well-behaved. It sounds like the answer to your original question should have been "go right ahead, you're fine". For what you're doing, polyfit should be quite robust, and the warning should not be a problem. Good luck, Anne
participants (7)
-
Anne Archibald
-
Bruce Southey
-
C.J.Lee@tnw.utwente.nl
-
Gnata Xavier
-
massimo sandal
-
Ryan May
-
Xavier Gnata