[Python-ideas] Pre-PEP 2nd draft: adding a statistics module to Python

Fri Aug 9 08:43:23 CEST 2013

On 09/08/13 12:39, Stephen J. Turnbull wrote:
> Steven D'Aprano writes:
>
>   > [2] But which ones? I know of at least four different definitions
>   > of each [of skew and kurtosis], three of which are in common
>   > use. Annoyingly, the differences are not documented.
>
> Do you mean "definitions" or "implementations"?

Yes :-)

Or possibly both. Skew and kurtosis are especially egregious examples of mathematicians being inconsistent in their terminology, notation and definitions, but as far as I have been able to determine, there are three common formulae for the third moment about the mean skewness:

Population skewness:
     γ₁ = ∑((x-μ)/σ)³ ÷ n

This, at least, everyone agrees on.

Biased sample skewness uses the same formula for γ₁, substituting the sample mean and sample standard deviation for population mean μ and standard deviation σ:

     g₁ = ∑((x-a)/sn)³ ÷ n
        = √n ∑(x-a)³ / (∑(x-a)²)**(3/2)

Note that's the *uncorrected* (n degrees of freedom) version of sample standard deviation, not the n-1 version.

There are at least three bias-corrected formulae for sample skewness. This is the version of skewness used by SAS, SPSS, Excel, LibreOffice:

     G₁ = √(n(n-1))÷(n-2) × g₁

And this is the version of skewness used by MINITAB:

     b₁ = ((n-1)/n)**(3/2) × g₁

There's a third I read about in a paper, but it doesn't appear to have been used anywhere. The paper's authors claim it has better properties than either of the above two.

Annoyingly, the notation g₁, G₁ and b₁ is sometimes used interchangeably, and I recall seeing somebody using K₁ or k₁ for one of the above (but I forget which one), but as near as I can determine, the above are the most common notations.

Kurtosis is much the same. There are two definitions for population kurtosis:

Pearson's kurtosis, or kurtois proper:
     β₂ = ∑((x-μ)/σ)⁴ ÷ n

Fisher's kurtosis, or excess kurtosis:
     γ₂ = β₂ - 3

Sample kurtosis g₂ is just γ₂ using sample mean and standard deviation. Again, there are at least three bias-corrected versions of the sample kurtosis. Here is the version used by SAS, SPSS, Excel, LibreOffice:

     G₂ = (n-1)(g₂(n+1) + 6) ÷ ((n-2)(n-3))
        = n(n+1)/((n-1)(n-2)(n-3)) × ∑(x-a)⁴ / s² - 3(n-1)²/((n-2)(n-3))

And the version from MINITAB:

     b₂ = ((n-1)/n)² × g₂ - 3

plus another version in the paper I mentioned above. And, like with skewness, people are inconsistent with notation, only sometimes worse if they don't distinguish between the "excess" or "proper" kurtosis.

-- 
Steven