[Python-ideas] Pre-PEP 2nd draft: adding a statistics module to Python
Steven D'Aprano
steve at pearwood.info
Fri Aug 9 08:43:23 CEST 2013
On 09/08/13 12:39, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
>
> > [2] But which ones? I know of at least four different definitions
> > of each [of skew and kurtosis], three of which are in common
> > use. Annoyingly, the differences are not documented.
>
> Do you mean "definitions" or "implementations"?
Yes :-)
Or possibly both. Skew and kurtosis are especially egregious examples of mathematicians being inconsistent in their terminology, notation and definitions, but as far as I have been able to determine, there are three common formulae for the third moment about the mean skewness:
Population skewness:
γ₁ = ∑((x-μ)/σ)³ ÷ n
This, at least, everyone agrees on.
Biased sample skewness uses the same formula for γ₁, substituting the sample mean and sample standard deviation for population mean μ and standard deviation σ:
g₁ = ∑((x-a)/sn)³ ÷ n
= √n ∑(x-a)³ / (∑(x-a)²)**(3/2)
Note that's the *uncorrected* (n degrees of freedom) version of sample standard deviation, not the n-1 version.
There are at least three bias-corrected formulae for sample skewness. This is the version of skewness used by SAS, SPSS, Excel, LibreOffice:
G₁ = √(n(n-1))÷(n-2) × g₁
And this is the version of skewness used by MINITAB:
b₁ = ((n-1)/n)**(3/2) × g₁
There's a third I read about in a paper, but it doesn't appear to have been used anywhere. The paper's authors claim it has better properties than either of the above two.
Annoyingly, the notation g₁, G₁ and b₁ is sometimes used interchangeably, and I recall seeing somebody using K₁ or k₁ for one of the above (but I forget which one), but as near as I can determine, the above are the most common notations.
Kurtosis is much the same. There are two definitions for population kurtosis:
Pearson's kurtosis, or kurtois proper:
β₂ = ∑((x-μ)/σ)⁴ ÷ n
Fisher's kurtosis, or excess kurtosis:
γ₂ = β₂ - 3
Sample kurtosis g₂ is just γ₂ using sample mean and standard deviation. Again, there are at least three bias-corrected versions of the sample kurtosis. Here is the version used by SAS, SPSS, Excel, LibreOffice:
G₂ = (n-1)(g₂(n+1) + 6) ÷ ((n-2)(n-3))
= n(n+1)/((n-1)(n-2)(n-3)) × ∑(x-a)⁴ / s² - 3(n-1)²/((n-2)(n-3))
And the version from MINITAB:
b₂ = ((n-1)/n)² × g₂ - 3
plus another version in the paper I mentioned above. And, like with skewness, people are inconsistent with notation, only sometimes worse if they don't distinguish between the "excess" or "proper" kurtosis.
--
Steven
More information about the Python-ideas
mailing list