new Kolmogorov-Smirnov test
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
Since the old scipy.stats.kstest wasn't correct, I spent quite some time fixing and testing it. Now, I know more about the Kolmogorov-Smirnov test, than I wanted to. The kstest now resembles the one in R and in matlab, giving the option for two-sided or one-sided tests. The names of the keyword options are a mixture of matlab and R, which I liked best. Since the exact distribution of the two-sided test is not available in scipy, I use an approximation, that seems to work very well. In several Monte Carlo studies against R, I get very close results, especially for small p-values. (For those interested, for small p-values, I use ksone.sf(D,n)*2; for large p-values or large n, I use the asymptotic distribution kstwobign) example signature and options: kstest(x,testdistfn.name, alternative = 'unequal', mode='approx')) kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp')) kstest(x,testdistfn.name, alternative = 'larger')) kstest(x,testdistfn.name, alternative = 'smaller')) Below is the Monte Carlo for the case, when the random variable and the hypothesized distribution both are standard normal (with sample size 100 and 1000 replications. Rejection rates are very close to alpha levels. It also contains the mean absolute error MAE for the old kstest. I also checked for mean shifted normal random variables. In all cases that I tried, I get exactly the same rejection rates as in R. For details see doc string or source. I attach file to a separate email, to get around attachment size limit. I intend to put this in trunk tomorrow, review and comments are welcome. Josef data generation distribution is norm, hypothesis is norm ================================================== n = 100, loc = 0.000000 scale = 1.000000, n_repl = 1000 columns: D, pval rows are kstest(x,testdistfn.name, alternative = 'unequal', mode='approx')) kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp')) kstest(x,testdistfn.name, alternative = 'larger')) kstest(x,testdistfn.name, alternative = 'smaller')) Results for comparison with R: MAE old kstest [[ 0.00453195 0.19152727] [ 0.00453195 0.2101139 ] [ 0.02002774 0.19145982] [ 0.02880553 0.26650226]] MAE new kstest [[ 1.87488913e-17 1.07738517e-02] [ 1.87488913e-17 1.91763848e-06] [ 2.38576520e-17 8.90287843e-16] [ 1.41312743e-17 9.92428362e-16]] percent count absdev > 0.005 [[ 0. 53.9] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ]] percent count absdev > 0.01 [[ 0. 24.3] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ]] percent count abs percent dev > 1% [[ 0. 51.8] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ]] percent count abs percent dev > 10% [[ 0. 0.] [ 0. 0.] [ 0. 0.] [ 0. 0.]] new: count rejection at 1% significance [ 0.01 0.008 0.009 0.014] R: proportion of rejection at 1% significance [ 0.01 0.008 0.009 0.014] new: proportion of rejection at 5% significance [ 0.054 0.048 0.048 0.06 ] R: proportion of rejection at 5% significance [ 0.054 0.048 0.048 0.06 ] new: proportion of rejection at 10% significance [ 0.108 0.096 0.095 0.109] R: proportion of rejection at 10% significance [ 0.108 0.096 0.095 0.109]
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
attachment new_kstest.py Josef
![](https://secure.gravatar.com/avatar/41a23bedf503f0a04d79e055fa70f465.jpg?s=120&d=mm&r=g)
Is there a correct two-sample k-s test in scipy? On Sat, Nov 29, 2008 at 4:46 PM, <josef.pktd@gmail.com> wrote:
Since the old scipy.stats.kstest wasn't correct, I spent quite some time fixing and testing it. Now, I know more about the Kolmogorov-Smirnov test, than I wanted to.
The kstest now resembles the one in R and in matlab, giving the option for two-sided or one-sided tests. The names of the keyword options are a mixture of matlab and R, which I liked best.
Since the exact distribution of the two-sided test is not available in scipy, I use an approximation, that seems to work very well. In several Monte Carlo studies against R, I get very close results, especially for small p-values. (For those interested, for small p-values, I use ksone.sf(D,n)*2; for large p-values or large n, I use the asymptotic distribution kstwobign)
example signature and options: kstest(x,testdistfn.name, alternative = 'unequal', mode='approx')) kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp')) kstest(x,testdistfn.name, alternative = 'larger')) kstest(x,testdistfn.name, alternative = 'smaller'))
Below is the Monte Carlo for the case, when the random variable and the hypothesized distribution both are standard normal (with sample size 100 and 1000 replications. Rejection rates are very close to alpha levels. It also contains the mean absolute error MAE for the old kstest. I also checked for mean shifted normal random variables. In all cases that I tried, I get exactly the same rejection rates as in R.
For details see doc string or source. I attach file to a separate email, to get around attachment size limit.
I intend to put this in trunk tomorrow, review and comments are welcome.
Josef
data generation distribution is norm, hypothesis is norm ================================================== n = 100, loc = 0.000000 scale = 1.000000, n_repl = 1000 columns: D, pval rows are kstest(x,testdistfn.name, alternative = 'unequal', mode='approx')) kstest(x,testdistfn.name, alternative = 'unequal', mode='asymp')) kstest(x,testdistfn.name, alternative = 'larger')) kstest(x,testdistfn.name, alternative = 'smaller'))
Results for comparison with R:
MAE old kstest [[ 0.00453195 0.19152727] [ 0.00453195 0.2101139 ] [ 0.02002774 0.19145982] [ 0.02880553 0.26650226]] MAE new kstest [[ 1.87488913e-17 1.07738517e-02] [ 1.87488913e-17 1.91763848e-06] [ 2.38576520e-17 8.90287843e-16] [ 1.41312743e-17 9.92428362e-16]] percent count absdev > 0.005 [[ 0. 53.9] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ]] percent count absdev > 0.01 [[ 0. 24.3] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ]] percent count abs percent dev > 1% [[ 0. 51.8] [ 0. 0. ] [ 0. 0. ] [ 0. 0. ]] percent count abs percent dev > 10% [[ 0. 0.] [ 0. 0.] [ 0. 0.] [ 0. 0.]] new: count rejection at 1% significance [ 0.01 0.008 0.009 0.014] R: proportion of rejection at 1% significance [ 0.01 0.008 0.009 0.014] new: proportion of rejection at 5% significance [ 0.054 0.048 0.048 0.06 ] R: proportion of rejection at 5% significance [ 0.054 0.048 0.048 0.06 ] new: proportion of rejection at 10% significance [ 0.108 0.096 0.095 0.109] R: proportion of rejection at 10% significance [ 0.108 0.096 0.095 0.109] _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
![](https://secure.gravatar.com/avatar/547665790a071bb74c66ff10cc3a378a.jpg?s=120&d=mm&r=g)
On Wed, Dec 3, 2008 at 9:35 AM, Roban Hultman Kramer <roban@astro.columbia.edu> wrote:
Is there a correct two-sample k-s test in scipy?
http://www.scipy.org/scipy/scipy/changeset/5213 -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, Dec 3, 2008 at 2:20 PM, Jarrod Millman <millman@berkeley.edu> wrote:
On Wed, Dec 3, 2008 at 9:35 AM, Roban Hultman Kramer <roban@astro.columbia.edu> wrote:
Is there a correct two-sample k-s test in scipy?
The kstest that I fixed, is a one-sample test, comparing one sample with a theoretical distribution. Josef
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, Dec 3, 2008 at 12:35 PM, Roban Hultman Kramer <roban@astro.columbia.edu> wrote:
Is there a correct two-sample k-s test in scipy?
in scipy.stats.stats def ks_2samp(data1, data2): """ Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- like. Returns: KS D-value, p-value It uses the special.kolmogorov which is the asymptotic two-sided distribution. It "looks" ok, but there are no tests for it, and I haven't tested it either. A quick Monte Carlo with your sample size would verify how accurate it is. Josef
![](https://secure.gravatar.com/avatar/b4929294417e9ac44c17967baae75a36.jpg?s=120&d=mm&r=g)
Hi,
def ks_2samp(data1, data2): """ Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- like.
Wait - really? We can't use Numerical Recipes code, it has strict and incompatible licensing... If it's in there it really has to come out as fast as possible. Matthew
![](https://secure.gravatar.com/avatar/547665790a071bb74c66ff10cc3a378a.jpg?s=120&d=mm&r=g)
On Wed, Dec 3, 2008 at 11:43 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
def ks_2samp(data1, data2): """ Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- like.
Wait - really? We can't use Numerical Recipes code, it has strict and incompatible licensing... If it's in there it really has to come out as fast as possible.
http://www.nr.com/licenses/redistribute.html -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/
![](https://secure.gravatar.com/avatar/ad13088a623822caf74e635a68a55eae.jpg?s=120&d=mm&r=g)
On Wed, Dec 3, 2008 at 2:49 PM, Jarrod Millman <millman@berkeley.edu> wrote:
On Wed, Dec 3, 2008 at 11:43 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
def ks_2samp(data1, data2): """ Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- like.
Wait - really? We can't use Numerical Recipes code, it has strict and incompatible licensing... If it's in there it really has to come out as fast as possible.
http://www.nr.com/licenses/redistribute.html
-- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/ _______________________________________________ SciPy-user mailing list SciPy-user@scipy.org http://projects.scipy.org/mailman/listinfo/scipy-user
The algorithm is essentially one loop to calculate the distance measure, I would assume that this simple algorithm cannot be copyright protected, but for efficiency, it might be better anyway to come up with a vectorized version similar to kstest. about correctness: ============= A quick Monte Carlo shows that the test is pretty accurate under the null even for small sample sizes, power to reject, if the alternative is true is only reasonably high in larger samples Null correct ================================================== Monte Carlo for K-S 2sample test (ks_2samp): sample size = 100, 1000 replications sample 1: normal distribution (loc=1.000000,scale=2.000000) sample 2: normal distribution (loc=1.000000,scale=2.000000) ks_2samp: proportion of rejection at 1% significance: 0.003 ks_2samp: proportion of rejection at 5% significance: 0.049 ks_2samp: proportion of rejection at 10% significance: 0.101 ========= Null not true: ================================================== Monte Carlo for K-S 2sample test (ks_2samp): sample size = 500, 1000 replications sample 1: normal distribution (loc=0.000000,scale=1.000000) sample 2: t distribution (dof=10, loc=0.000000,scale=1.000000) ks_2samp: proportion of rejection at 1% significance: 0.253 ks_2samp: proportion of rejection at 5% significance: 0.71 ks_2samp: proportion of rejection at 10% significance: 0.88 Josef
participants (4)
-
Jarrod Millman
-
josef.pktd@gmail.com
-
Matthew Brett
-
Roban Hultman Kramer