numpy/scipy: error of correlation coefficient (clumpy data)

Thu Nov 16 04:17:05 EST 2006

sturlamolden wrote:
> robert wrote:
> 
>> here the bootstrap test will as well tell us, that the confidence intervall narrows down by a factor ~sqrt(10) - just the same as if there would be 10-fold more of well distributed "new" data. Thus this kind of error estimation has no reasonable basis for data which is not very good.
> 
> 
> The confidence intervals narrows when the amount of independent data
> increases. If you don't understand why, then you lack a basic
> understanding of statistics. Particularly, it is a fundamental
> assumption in most statistical models that the data samples are
> "IDENTICALLY AND INDEPENDENTLY DISTRIBUTED", often abbreviated "i.i.d."
> And it certainly is assumed in this case. If you tell the computer (or
> model) that you have i.i.d. data, it will assume it is i.i.d. data,
> even when its not. The fundamental law of computer science also applies
> to statistics: shit in = shit out. If you nevertheless provide data
> that are not i.i.d., like you just did, you will simply obtain invalid
> results.
> 
> The confidence interval concerns uncertainty about the value of a
> population parameter, not about the spread of your data sample. If you
> collect more INDEPENDENT data, you know more about the population from
> which the data was sampled. The confidence interval has the property
> that it will contain the unknown "true correlation" 95% of the times it
> is generated. Thus if you two samples WITH INDEPENDENT DATA from the
> same population, one small and one large, the large sample will
> generate a narrower confidence interval. Computer intensive methods
> like bootstrapping and asymptotic approximations derived analytically
> will behave similarly in this respect. However, if you are dumb enough
> to just provide duplications of your data, the computer is dumb enough
> to accept that they are obtained statistically independently. In
> statistical jargon this is called "pseudo-sampling", and is one of the
> most common fallacies among uneducated practitioners.

that duplication is just an extreme example to show my need: When I get the data, there can be an inherent filter/damping or other mean of clumping on the data which I don't know of beforehand. My model is basically linear (its a preparation step for ranking valuable input data for a classification task, thus for data reduction), just the degree of clumping in the data is unknown. Thus the formula for r is ok, but that bare i.i.d. formula for the error "(1-r**2)/sqrt(n)" (or bootstrap_test same) is blind on that.

> Statistical software doesn't prevent the practitioner from shooting
> himself in the leg; it actually makes it a lot easier. Anyone can paste
> data from Excel into SPSS and hit "ANOVA" in the menu. Whether the
> output makes any sense is a whole other story. One can duplicate each
> sample three or four times, and SPSS would be ignorant of that fact. It
> cannot guess that you are providing it with crappy data, and prevent
> you from screwing up your analysis. The same goes for NumPy code. The
> statistical formulas you type in Python have certain assumptions, and
> when they are violated the output is of no value. The more severe the
> violation, the less valuable is the output.
> 
>> The interesting task is probably this: to check for linear correlation but "weight clumping of data" somehow for the error estimation.
> 
> If you have a pathological data sample, then you need to specify your
> knowledge in greater detail. Can you e.g. formulate a reasonable
> stochastic model for your data, fit the model parameters using the
> data, and then derive the correlation analytically?

no, its too complex. Or its just: additional clumping/fractality in the data. 
Thus, linear correlation is supposed, but the x,y data distribution may have "less than 2 dimensions". No better model.

Think of such example: A drunken (x,y) 2D walker is supposed to walk along a diagonal, but he makes frequent and unpredictable pauses/slow motion. You get x,y coordinates in 1 per second. His speed and time pattern at all do not matter - you just want to know how well he keeps his track.
( My application data is even worse/blackbox, there is not even such "model"   )

> I am beginning to think your problem is ill defined because you lack a
> basic understanding of maths and statistics. For example, it seems you
> were confusing numerical error (rounding and truncation error) with
> statistical sampling error, you don't understand why standard errors
> decrease with sample size, you are testing with pathological data, you
> don't understand the difference between independent data and data
> duplications, etc. You really need to pick up a statistics textbook and
> do some reading, that's my advice.

I think I understand all this very well. Its not on this level. The problem has also nothing to do with rounding, sampling errors etc.
Of course the error ~1/sqrt(n) is the basic assumption - not what I not know, but what I "complain" about :-)   (Thus I even guessed the "dumb" formula for r_err well before I saw it somewhere. This is all not the real question.).
Yet I need a way to _NOT_ just fall on that ~1/sqrt(n) for the error, when there is unknown clumping in the data. It has to be a smarter - say automatic non-i.i.d. computation for a reasonable confidence intervall/error of the correlation - in absence of a final/total modell.

Thats not even an exceptional application. In most measured data which is created by iso-timestep sampling (thus not "pathological" so far?), the space of some interesting 2 variables may be walked "non-iso". Think of any time series data, where most of the data is "boring"/redundant because the flux of the experiment is so, that interesting things happen only occasionally. In absence of a full model for the "whole history", one could try to preprocess the x,y data by attaching a density weight in order to make it "non-pathological" before feeding it into the formula for r,r_err. Yet this is expensive. Or one could think of computing a rough fractal dimension and decorate the error like fracconst * (1-r**2)/sqrt(n)

The (fast) formula I'm looking for - possibly it doesn't exist - should do this in a rush.

Robert