new Kolmogorov-Smirnov 2-sample test not Numerical Recipes based
On Wed, Dec 3, 2008 at 3:15 PM, <josef.pktd@gmail.com> wrote:
On Wed, Dec 3, 2008 at 2:49 PM, Jarrod Millman <millman@berkeley.edu> wrote:
On Wed, Dec 3, 2008 at 11:43 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
def ks_2samp(data1, data2): """ Computes the Kolmogorov-Smirnof statistic on 2 samples. Modified from Numerical Recipies in C, page 493. Returns KS D-value, prob. Not ufunc- like.
Wait - really? We can't use Numerical Recipes code, it has strict and incompatible licensing... If it's in there it really has to come out as fast as possible.
I did a 2sample kstest based on the definition using search sorted, see function below. Attached is script file with standalone old and new versions and Monte Carlo evaluation. I didn't do any speed comparison. The function works only for one dimensional data, neither does the old one, but there is some initial more than 1 dimension setup in the old version that I don't understand. Main reference used (after googling): http://math.ucsd.edu/~gptesler/283/kolmogorov_smirnov_05.pdf If there is any interest, I can replace the existing implementation with the new function Josef Conclusion: Comparing old implementation based on numerical recipes with new implementation * in 2 samples have same size then difference in KS statistic is less than 1e-14 * if sample sizes differ * in n1=2, n2=3 example new version is correct, old version is wrong * Monte Carlo essentially identical results (sample sizes between 100 and 1000 rejection rates for alpha = 1%,5%,10% sometimes differ in 3rd decimal
data1 = np.array([1.0,2.0]) data2 = np.array([1.0,2.0,3.0]) ks_2samp_new(data1+0.01,data2) (0.33333333333333337, 0.99062316386915694) ks_2samp(data1+0.01,data2) (0.33333333333333331, 0.99062316386915694) ks_2samp_new(data1-0.01,data2) (0.66666666666666674, 0.42490954988801982) ks_2samp(data1-0.01,data2) # result not correct (-0.5, 0.77962787254643151)
=============== def ks_2samp_new(data1, data2): """ Computes the Kolmogorov-Smirnof statistic on 2 samples. Returns: KS D-value, p-value """ data1, data2 = map(asarray, (data1, data2)) n1 = data1.shape[0] n2 = data2.shape[0] n1 = len(data1) n2 = len(data2) data1 = np.sort(data1) data2 = np.sort(data2) data_all = np.concatenate([data1,data2]) cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1) cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2) d = np.max(np.absolute(cdf1-cdf2)) #Note: d absolute not signed distance en = np.sqrt(n1*n2/float(n1+n2)) try: prob = ksprob((en+0.12+0.11/en)*d) except: prob = 1.0 return d, prob ====================
Hi,
I did a 2sample kstest based on the definition using search sorted, see function below.
[snip]
If there is any interest, I can replace the existing implementation with the new function
[snip]
* in n1=2, n2=3 example new version is correct, old version is wrong * Monte Carlo essentially identical results (sample sizes between 100 and 1000 rejection rates for alpha = 1%,5%,10% sometimes differ in 3rd decimal
Given more secure license compatibility and that you think the current version is correct where the other is wrong, I think this should go in now, and into the 0.7 release. Any disagreement? Matthew
Hi, I forgot to ask: I took the speed of convergence correction for the asymptotic KS distribution from the existing implementation. With a quick look on the internet I didn't find any reference for the correction. en = np.sqrt(n1*n2/float(n1+n2)) # same as in http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test prob = ksprob((en+0.12+0.11/en)*d) I'm curious why it is not just the sqrt(n) analog, i.e. prob = ksprob(en*d) I tried it in the Monte Carlo and ksprob(en*d) has slightly less power than ksprob((en+0.12+0.11/en)*d). Josef
On Wed, Dec 3, 2008 at 20:58, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I did a 2sample kstest based on the definition using search sorted, see function below.
[snip]
If there is any interest, I can replace the existing implementation with the new function
[snip]
* in n1=2, n2=3 example new version is correct, old version is wrong * Monte Carlo essentially identical results (sample sizes between 100 and 1000 rejection rates for alpha = 1%,5%,10% sometimes differ in 3rd decimal
Given more secure license compatibility and that you think the current version is correct where the other is wrong, I think this should go in now, and into the 0.7 release. Any disagreement?
None from me. -- Robert Kern "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco
On Thu, Dec 4, 2008 at 11:58 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I did a 2sample kstest based on the definition using search sorted, see function below.
[snip]
If there is any interest, I can replace the existing implementation with the new function
[snip]
* in n1=2, n2=3 example new version is correct, old version is wrong * Monte Carlo essentially identical results (sample sizes between 100 and 1000 rejection rates for alpha = 1%,5%,10% sometimes differ in 3rd decimal
Given more secure license compatibility and that you think the current version is correct where the other is wrong, I think this should go in now, and into the 0.7 release. Any disagreement?
Nope, this change is fine. I won't complain this time :) David
participants (4)
-
David Cournapeau
-
josef.pktd@gmail.com
-
Matthew Brett
-
Robert Kern