![](https://secure.gravatar.com/avatar/6f3cb304671ae5b6ea04dfe0e7948651.jpg?s=120&d=mm&r=g)
Hello, I observed that there are 2 standard deviation functions in the scipy/numpy modules: Numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html#numpy.std Scipy: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.std.html#sci... What is the difference? There is no formula included within the docstrings. I suppose that np.std() is for the whole population and scipy.std is designed for a smaller sample in the population. Is that true? Are there any functions for calculating the mean bias error (MBE)? I am looking for forumla 3 in http://en.wikipedia.org/wiki/Mean_squared_error#Examples The function mbe seems to implement it here: http://cerea.enpc.fr/polyphemus/doc/atmopy/public/atmopy.stat.measure-module... I also found a implementation of root mean square error (RMSE), as function rms, in: http://www2-pcmdi.llnl.gov/cdat/source/api-reference/genutil.statistics.html Unfortunately, cdat cannot be installed natively on Windows. Thanks in advance, Timmie
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, Dec 17, 2008 at 7:58 PM, Tim Michelsen <timmichelsen@gmx-topmail.de> wrote:
Hello, I observed that there are 2 standard deviation functions in the scipy/numpy modules:
Numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html#numpy.std
Scipy: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.std.html#sci...
What is the difference? There is no formula included within the docstrings.
I suppose that np.std() is for the whole population and scipy.std is designed for a smaller sample in the population. Is that true?
difference between population (numpy) and sample (scipy.stats) variance and standard deviation is whether the the estimator is biased, i.e. 1/n, or not, i.e. 1/(n-1). Look at description in source http://docs.scipy.org/scipy/source/scipy/dist/lib64/python2.4/site-packages/... for depreciation warning. See also distinction in your wikipedia reference for biased versus unbiased.
Are there any functions for calculating the mean bias error (MBE)?
I am looking for forumla 3 in http://en.wikipedia.org/wiki/Mean_squared_error#Examples
I'm not sure what your use case is but, in the referenced 3rd line, the MSE is the theoretical MSE of the estimator and it is not calculated from the sample. Overall, this are one liners in any matrix/array package For example when I do a Monte Carlo for an estimator, theta_hat, when the true parameter is theta (a scalar constant), and theta_hat is the array of estimators for the different runs, then the RMSE is just RMSE = np.sqrt(np.sum( theta_hat - theta)**2 ) / float(n) ) For the first Wikipedia example: MSE for observed Y_i compared to predicted Yhat_i is just MSE = np.sum( (Y - Y_hat)**2 ) / float(n) Josef
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Wed, Dec 17, 2008 at 19:53, <josef.pktd@gmail.com> wrote:
On Wed, Dec 17, 2008 at 7:58 PM, Tim Michelsen <timmichelsen@gmx-topmail.de> wrote:
Hello, I observed that there are 2 standard deviation functions in the scipy/numpy modules:
Numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html#numpy.std
Scipy: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.std.html#sci...
What is the difference? There is no formula included within the docstrings.
I suppose that np.std() is for the whole population and scipy.std is designed for a smaller sample in the population. Is that true?
difference between population (numpy) and sample (scipy.stats) variance and standard deviation is whether the the estimator is biased, i.e. 1/n, or not, i.e. 1/(n-1).
It's a shame that the "biased/unbiased" terminology still survives in the numpy.std() docstring. It's really quite wrong. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, Dec 17, 2008 at 9:03 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Dec 17, 2008 at 19:53, <josef.pktd@gmail.com> wrote:
On Wed, Dec 17, 2008 at 7:58 PM, Tim Michelsen <timmichelsen@gmx-topmail.de> wrote:
Hello, I observed that there are 2 standard deviation functions in the scipy/numpy modules:
Numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html#numpy.std
Scipy: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.std.html#sci...
What is the difference? There is no formula included within the docstrings.
I suppose that np.std() is for the whole population and scipy.std is designed for a smaller sample in the population. Is that true?
difference between population (numpy) and sample (scipy.stats) variance and standard deviation is whether the the estimator is biased, i.e. 1/n, or not, i.e. 1/(n-1).
It's a shame that the "biased/unbiased" terminology still survives in the numpy.std() docstring. It's really quite wrong.
I find talking about biased versus unbiased estimator much clearer than the population - sample distinction, and degrees of freedom might be more descriptive but its meaning, I guess, relies on knowing about the (asymptotic) distribution of the estimator, which I always forget and have to look up. Josef
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Wed, Dec 17, 2008 at 20:32, <josef.pktd@gmail.com> wrote:
On Wed, Dec 17, 2008 at 9:03 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Wed, Dec 17, 2008 at 19:53, <josef.pktd@gmail.com> wrote:
On Wed, Dec 17, 2008 at 7:58 PM, Tim Michelsen <timmichelsen@gmx-topmail.de> wrote:
Hello, I observed that there are 2 standard deviation functions in the scipy/numpy modules:
Numpy: http://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html#numpy.std
Scipy: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.std.html#sci...
What is the difference? There is no formula included within the docstrings.
I suppose that np.std() is for the whole population and scipy.std is designed for a smaller sample in the population. Is that true?
difference between population (numpy) and sample (scipy.stats) variance and standard deviation is whether the the estimator is biased, i.e. 1/n, or not, i.e. 1/(n-1).
It's a shame that the "biased/unbiased" terminology still survives in the numpy.std() docstring. It's really quite wrong.
I find talking about biased versus unbiased estimator much clearer than the population - sample distinction, and degrees of freedom might be more descriptive but its meaning, I guess, relies on knowing about the (asymptotic) distribution of the estimator, which I always forget and have to look up.
The problem is that the "unbiased" estimate for the standard deviation is *not* the square root of the "unbiased" estimate for the variance. The latter is what numpy.std(x, ddof=1) calculates, not the former. This problem arises because of a pretty narrow concept of "error" that gets misapplied to the variance estimator. The usual "error" that gets used is the arithmetic difference between the estimate and the true value of the parameter (p_est - p_true). For parameters like means, this is usually fine, but for so-called scale parameters like variance, it's quite inappropriate. For example, the arithmetic error between a true value of 1.0 (in whatever units) and an estimate of 2.0 is the same as that between 101.0 and 102.0. When you drop a square root into that formula, you don't get the same answers out when you seek the estimator that sets the bias to 0. Rather, a much more appropriate error measure for variance would be the log-ratio: log(p_est/p_true). That way, 1.0 and 2.0 would be the same distance from each other as 100.0 and 200.0. Using this measure of error to define bias, the unbiased estimate of the standard deviation actually is the square root of the unbiased estimate of the variance, too, thanks to the magic of logarithms. Unfortunately for those who want to call the (n-1) version "unbiased", the unbiased estimator (for normal distributions, at least) uses (n-2). Oops! Other distributions have different optimal denominators: heavier-tailed distributions tend towards (n-3), finite-support distributions tend towards (n-1). But of course, bias is not the only thing to be concerned about. The bias is just the arithmetic average of the errors. If you want to minimize the total spread of the errors sum(abs(err)), too, that's another story. With the arithmetic error metric, the unbiased estimator of the variance uses (n-1) while the estimator with the smallest total error uses (n). With the log-ratio error metric, the unbiased estimator is the same as the one that minimizes the total error. Happy days! I also find the population/sample distinction to be bogus, too. IIRC, there are even some sources which switch around the meanings, too. In any case, the docstring should have a section saying, "If you are looking for what is called the "unbiased" or "sample" estimate of variance, use ddof=1." Those terms are widely, if incorrectly, used, so we should mention them. I just find it disheartening that the terms are used without qualification. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, Dec 17, 2008 at 10:11 PM, Robert Kern <robert.kern@gmail.com> wrote:
The problem is that the "unbiased" estimate for the standard deviation is *not* the square root of the "unbiased" estimate for the variance. The latter is what numpy.std(x, ddof=1) calculates, not the former. This problem arises because of a pretty narrow concept of "error" that gets misapplied to the variance estimator. The usual "error" that gets used is the arithmetic difference between the estimate and the true value of the parameter (p_est - p_true). For parameters like means, this is usually fine, but for so-called scale parameters like variance, it's quite inappropriate. For example, the arithmetic error between a true value of 1.0 (in whatever units) and an estimate of 2.0 is the same as that between 101.0 and 102.0. When you drop a square root into that formula, you don't get the same answers out when you seek the estimator that sets the bias to 0.
Old habits lead me astray, in response to your previous email, I checked the docs for variance not for standard deviation. I never looked at the statistical properties of estimators of the standard deviation only those of variance estimators. I learned the unbiased estimator as a contrast to the maximum likelihood estimator for the variance. So, thank you for the clarification, I guess it is the same story with the scale parameter estimation in the distribution estimation.
Rather, a much more appropriate error measure for variance would be the log-ratio: log(p_est/p_true). That way, 1.0 and 2.0 would be the same distance from each other as 100.0 and 200.0. Using this measure of error to define bias, the unbiased estimate of the standard deviation actually is the square root of the unbiased estimate of the variance, too, thanks to the magic of logarithms. Unfortunately for those who want to call the (n-1) version "unbiased", the unbiased estimator (for normal distributions, at least) uses (n-2). Oops! Other distributions have different optimal denominators: heavier-tailed distributions tend towards (n-3), finite-support distributions tend towards (n-1).
But of course, bias is not the only thing to be concerned about. The bias is just the arithmetic average of the errors. If you want to minimize the total spread of the errors sum(abs(err)), too, that's another story. With the arithmetic error metric, the unbiased estimator of the variance uses (n-1) while the estimator with the smallest total error uses (n). With the log-ratio error metric, the unbiased estimator is the same as the one that minimizes the total error. Happy days!
Agreed, but this then leads to the discussion of what the appropriate loss function for the estimator is, which for a standard statistical presentation will go to far. Fortunately, for large samples, n versus n-1 or n-2 doesn't matter, and asymptotically we are all the same, at least in nice cases.
I also find the population/sample distinction to be bogus, too. IIRC, there are even some sources which switch around the meanings, too. In any case, the docstring should have a section saying, "If you are looking for what is called the "unbiased" or "sample" estimate of variance, use ddof=1." Those terms are widely, if incorrectly, used, so we should mention them. I just find it disheartening that the terms are used without qualification.
I compared the doc strings for np.var and np.std. np.std is very brief on biasedness, and I think the sentence in np.var is relatively neutral, neither doc string mentions the word unbiased. Your reference to "unbiased" and "sample" estimator might be useful for users that are not so familiar with the statistical background. After your explanation, I would think it would be better to drop any reference to biasedness in np.std. On the other hand, the scipy doc strings still need a lot of cleaning and writing. Josef
![](https://secure.gravatar.com/avatar/9820b5956634e5bbad7f4ed91a232822.jpg?s=120&d=mm&r=g)
josef.pktd@gmail.com wrote:
I compared the doc strings for np.var and np.std. np.std is very brief on biasedness, and I think the sentence in np.var is relatively neutral, neither doc string mentions the word unbiased. Your reference to "unbiased" and "sample" estimator might be useful for users that are not so familiar with the statistical background. After your explanation, I would think it would be better to drop any reference to biasedness in np.std. On the other hand, the scipy doc strings still need a lot of cleaning and writing.
Those functions are deprecated in 0.7, so this should not be too much of a concern. David
![](https://secure.gravatar.com/avatar/6f3cb304671ae5b6ea04dfe0e7948651.jpg?s=120&d=mm&r=g)
I also find the population/sample distinction to be bogus, too. IIRC, there are even some sources which switch around the meanings, too. In any case, the docstring should have a section saying, "If you are looking for what is called the "unbiased" or "sample" estimate of variance, use ddof=1." Those terms are widely, if incorrectly, used, so we should mention them. I just find it disheartening that the terms are used without qualification. First, let me state that I leaned statistics in German and hoped to find
Hello, the right translation term. For some terms, other languages may even not have a suitable translation. Second, I would suggest to include the actual formulas in such disputed docstrings. I have seen that depending of the area of work, people tend use correct scientific connotations or other terms (science vs. engineering). The sphinxext of matplotlib offers a very convenient way to include maths formulas. These would make it clear. In Excel (or Calc), the functions have different names but formulas remain. Thanks for the explanations. Kind regdards, Timmie
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Thu, Dec 18, 2008 at 04:10, Timmie <timmichelsen@gmx-topmail.de> wrote:
Hello,
I also find the population/sample distinction to be bogus, too. IIRC, there are even some sources which switch around the meanings, too. In any case, the docstring should have a section saying, "If you are looking for what is called the "unbiased" or "sample" estimate of variance, use ddof=1." Those terms are widely, if incorrectly, used, so we should mention them. I just find it disheartening that the terms are used without qualification. First, let me state that I leaned statistics in German and hoped to find the right translation term. For some terms, other languages may even not have a suitable translation.
The terms are commonly used in English the same way that you are using them. I just happen to disagree with the common practice.
Second, I would suggest to include the actual formulas in such disputed docstrings.
The development version does: http://docs.scipy.org/numpy/docs/numpy.core.fromnumeric.std/ (before I edited it just now, even). -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/6f3cb304671ae5b6ea04dfe0e7948651.jpg?s=120&d=mm&r=g)
Second, I would suggest to include the actual formulas in
such disputed docstrings.
The development version does:
http://docs.scipy.org/numpy/docs/numpy.core.fromnumeric.std/
(before I edited it just now, even). Yes, but they are not rendered as in here: http://matplotlib.sourceforge.net/devel/documenting_mpl.html#formatting
What do you think?
![](https://secure.gravatar.com/avatar/af6c39d6943bd4b0e1fde23161e7bb8c.jpg?s=120&d=mm&r=g)
Hi Robert 2008/12/18 Robert Kern <robert.kern@gmail.com>:
The terms are commonly used in English the same way that you are using them. I just happen to disagree with the common practice.
I am fond of your explanation on "bias" (I've read it a couple of times now). Would you consider writing it up somewhere? There are so many common misconceptions around this topic that such a document would be a valuable resource. I saw the term again earlier today in some of Stein's papers (Estimation with Quadratic Loss and Estimation of the Mean of a Multivariate Normal Distribution), and was reminded of this thread. Regards Stéfan
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
I find the new doc string for np.var a bit misleading: "The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead." This seems to refer to the divisor of the mean not of the variance. I need to sign up for doc editing rights. Josef
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Thu, Dec 18, 2008 at 09:43, <josef.pktd@gmail.com> wrote:
I find the new doc string for np.var a bit misleading:
"The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead."
This seems to refer to the divisor of the mean not of the variance.
Yes, the "mean" as used in the variance formula given in the prior sentence. I don't really like the phrasing much, either, but I inherited it from what was already there. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Thu, Dec 18, 2008 at 3:36 PM, Robert Kern <robert.kern@gmail.com> wrote:
On Thu, Dec 18, 2008 at 09:43, <josef.pktd@gmail.com> wrote:
I find the new doc string for np.var a bit misleading:
"The mean is normally calculated as x.sum() / N, where N = len(x). If, however, ddof is specified, the divisor N - ddof is used instead."
This seems to refer to the divisor of the mean not of the variance.
Yes, the "mean" as used in the variance formula given in the prior sentence. I don't really like the phrasing much, either, but I inherited it from what was already there.
The problem is that "mean" shows up twice in the previous sentence. What about replacing mean in the sentence "The mean is normally calculated as x.sum() / N, where N = len(x)." by "average", which would unambiguously refer to the mean(average) of the squared deviations in the previous sentence. An english question: is divisor a synonym for denominator in common use? When I looked it up in the dictionary it didn't seem to be the case. Josef
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
On 12/18/2008 12:02 PM, Robert Kern wrote:
The terms are commonly used in English the same way that you are using them. I just happen to disagree with the common practice.
I agree with this. Also: "The problem is that the "unbiased" estimate for the standard deviation is *not* the square root of the "unbiased" estimate for the variance. The latter is what numpy.std(x, ddof=1) calculates, not the former." An unbiased variance estimate is what people usually want. But 9 out of 10 practitioners think they need an unbiased standard deviation, and they think they get it from normalizing by N-1. They do the "right thing" just because their Stat 101 text tell them to, or because SPSS or MINITAB is doing it by default. Erroneous use of statistics due to mathematical incompetence is a major contribution to bad science. Perhaps it is better if the docstring just specifies that ddof=1 normalizes by N-1, whereas ddof=0 normalizes by N? Sturla Molden
![](https://secure.gravatar.com/avatar/764323a14e554c97ab74177e0bce51d4.jpg?s=120&d=mm&r=g)
On Fri, Dec 19, 2008 at 11:23, Sturla Molden <sturla@molden.no> wrote:
On 12/18/2008 12:02 PM, Robert Kern wrote:
The terms are commonly used in English the same way that you are using them. I just happen to disagree with the common practice.
I agree with this. Also:
"The problem is that the "unbiased" estimate for the standard deviation is *not* the square root of the "unbiased" estimate for the variance. The latter is what numpy.std(x, ddof=1) calculates, not the former."
An unbiased variance estimate is what people usually want. But 9 out of 10 practitioners think they need an unbiased standard deviation, and they think they get it from normalizing by N-1. They do the "right thing" just because their Stat 101 text tell them to, or because SPSS or MINITAB is doing it by default. Erroneous use of statistics due to mathematical incompetence is a major contribution to bad science.
Perhaps it is better if the docstring just specifies that ddof=1 normalizes by N-1, whereas ddof=0 normalizes by N?
How does the current version strike you? http://docs.scipy.org/numpy/docs/numpy.core.fromnumeric.std/ http://docs.scipy.org/numpy/docs/numpy.core.fromnumeric.var/ -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
![](https://secure.gravatar.com/avatar/86ea939a72cee216b3c076b52f48f338.jpg?s=120&d=mm&r=g)
On 12/19/2008 6:51 PM, Robert Kern wrote:
How does the current version strike you?
http://docs.scipy.org/numpy/docs/numpy.core.fromnumeric.std/ http://docs.scipy.org/numpy/docs/numpy.core.fromnumeric.var/
It looks accurate. :) Also it mentions that ddof=0 gives the ML estimate, which is often overlooked. A warning about what ddof=1 may/will do to the standard error of the variance would also be useful. Estimating the variance unbiased can be equivalent of throwing away a substantial portion of the data; which in turn may translate to a lot of lost investment in work and money. Sturla Molden
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Fri, Dec 19, 2008 at 1:05 PM, Sturla Molden <sturla@molden.no> wrote:
On 12/19/2008 6:51 PM, Robert Kern wrote:
How does the current version strike you?
http://docs.scipy.org/numpy/docs/numpy.core.fromnumeric.std/ http://docs.scipy.org/numpy/docs/numpy.core.fromnumeric.var/
It looks accurate. :)
Also it mentions that ddof=0 gives the ML estimate, which is often overlooked.
A warning about what ddof=1 may/will do to the standard error of the variance would also be useful. Estimating the variance unbiased can be equivalent of throwing away a substantial portion of the data; which in turn may translate to a lot of lost investment in work and money.
Why would you be throwing away data if you use a different normalization? I think the only serious point about the degrees of freedom correction is when using the distribution of the estimator, e.g. for testing, and there the ddof is given by the statistical theory. Wether an estimate for the variance or standard deviation in a report is normalized by N or N-1 doesn't really matter, given the randomness of the statistical problem, at least I never checked what normalization the author used. Josef
![](https://secure.gravatar.com/avatar/6f3cb304671ae5b6ea04dfe0e7948651.jpg?s=120&d=mm&r=g)
biased, i.e. 1/n, or not, i.e. 1/(n-1). Look at description in source
http://docs.scipy.org/scipy/source/scipy/dist/lib64/python2.4/site-packages/...
for depreciation warning. Why do we not link the scipy.std to numpy.std?
scipy.stats.std = numpy.std? Or are there problems with old code depending on it?
![](https://secure.gravatar.com/avatar/6f3cb304671ae5b6ea04dfe0e7948651.jpg?s=120&d=mm&r=g)
Timmie <timmichelsen <at> gmx-topmail.de> writes:
biased, i.e. 1/n, or not, i.e. 1/(n-1). Look at description in source
http://docs.scipy.org/scipy/source/scipy/dist/lib64/python2.4/site-packages/...
for depreciation warning. Why do we not link the scipy.std to numpy.std?
scipy.stats.std = numpy.std?
Or are there problems with old code depending on it? It seems there were earlier reports on this: http://thread.gmane.org/gmane.comp.python.scientific.user/13677/focus=13679
The question seems to be: Which statistical functions shall be catered for in which library.
![](https://secure.gravatar.com/avatar/6f3cb304671ae5b6ea04dfe0e7948651.jpg?s=120&d=mm&r=g)
Hello, although there were many explanations, I am still trying to find the right correlations function. When I import my data in excel or openoffice calc and apply the correlation function (=correlate) there, I get a totally different result (+0.9) than with np.correlate( 5e+23). I am not shure whether a) I am still using the wrong function in numpy/scipy b) the excel-calulations are wrong. But I am pretty sure that they are not because Openoffice Calc gives the same results. A hint is very welcome here. There was an earlier discussion here: Regression report like in Excel - http://article.gmane.org/gmane.comp.python.scientific.user/9537 But since the sandbox has gone... Thanks in advance, Timmie
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Thu, Dec 18, 2008 at 2:30 PM, Timmie <timmichelsen@gmx-topmail.de> wrote:
Hello, although there were many explanations, I am still trying to find the right correlations function.
When I import my data in excel or openoffice calc and apply the correlation function (=correlate) there, I get a totally different result (+0.9) than with np.correlate( 5e+23). I am not shure whether a) I am still using the wrong function in numpy/scipy b) the excel-calulations are wrong. But I am pretty sure that they are not because Openoffice Calc gives the same results.
A hint is very welcome here.
There was an earlier discussion here:
Regression report like in Excel - http://article.gmane.org/gmane.comp.python.scientific.user/9537
But since the sandbox has gone...
Thanks in advance, Timmie
_______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
If you are looking for the correlation coefficient and covariance, than the function is numpy.corrcoef(x, y=None, rowvar=1, bias=0) np.correlate does not calculate a correlation coefficient. Currently there are no convenience wrappers for statistical models included in scipy, but they will come back and get included once they are cleaned up enough. For ols, there is a good example in the scipy cookbook that calculates many regression diagnostics and prints a useful summary. Josef
participants (7)
-
David Cournapeau
-
josef.pktd@gmail.com
-
Robert Kern
-
Sturla Molden
-
Stéfan van der Walt
-
Tim Michelsen
-
Timmie