[SciPy-User] ks_2samp is not giving the same results as ks.test in R
josef.pktd at gmail.com
josef.pktd at gmail.com
Fri Nov 2 11:41:46 EDT 2012
On Thu, Nov 1, 2012 at 11:27 PM, <josef.pktd at gmail.com> wrote:
> On Thu, Nov 1, 2012 at 9:42 PM, <josef.pktd at gmail.com> wrote:
>> On Thu, Nov 1, 2012 at 9:41 PM, <josef.pktd at gmail.com> wrote:
>>> On Thu, Nov 1, 2012 at 9:14 PM, <josef.pktd at gmail.com> wrote:
>>>> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> The ks_2samp function does not give the same answer as ks.test in R.
>>>>> Does anybody know why they are different? Is ks_2samp compute
>>>>> something different?
>>>>>
>>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>>>>>> ks.test(1:5, 11:15)
>>>>>
>>>>> Two-sample Kolmogorov-Smirnov test
>>>>>
>>>>> data: 1:5 and 11:15
>>>>> D = 1, p-value = 0.007937
>>>>> alternative hypothesis: two-sided
>>>>>
>>>>>> ks.test(1:5, 11:15, alternative='less')
>>>>>
>>>>> Two-sample Kolmogorov-Smirnov test
>>>>>
>>>>> data: 1:5 and 11:15
>>>>> D^- = 0, p-value = 1
>>>>> alternative hypothesis: the CDF of x lies below that of y
>>>>>
>>>>>> ks.test(1:5, 11:15, alternative='greater')
>>>>>
>>>>> Two-sample Kolmogorov-Smirnov test
>>>>>
>>>>> data: 1:5 and 11:15
>>>>> D^+ = 1, p-value = 0.006738
>>>>> alternative hypothesis: the CDF of x lies above that of y
>>>>>
>>>>>>
>>>>>>
>>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
>>>>> (1.0, 0.0037813540593701006)
>>>>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
>>>>> #!/usr/bin/env python
>>>>>
>>>>> from scipy.stats import ks_2samp
>>>>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])
>>>>
>>>> R uses by default an "exact" distribution for small samples if there
>>>> are no ties.
>>>> If there are ties or with a large sample, R uses the asymptotic distribution.
>>>>
>>>> If I read the function correctly, then scipy.stats is using a small
>>>> sample approximation by Stephens. (But I would have to look up the
>>>> formula to verify this.)
>>>
>>> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
>>> has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2))
>>> the small sample weighting ((en+0.12+0.11/en)*d) is the same as in
>>> Stephens (1970, 1985?) for the one sample test.
>>> I don't have a reference for the two sample approximation right now.
>>>
>>> (another bit of random information)
>>> tables are often only available for 0.01 to 0.25 and approximations
>> (hit send too fast) 0.001 to 0.25
>>> are targeted on that range and might not be as accurate outside of it
>>>
>>> Josef
>>>
>>>
>>>>
>>>> In the example below with a bit larger sample and no ties, our
>>>> approximation is closer to R's "exact" pvalue than the asymptotic
>>>> distribution if exact=FALSE.
>>>>
>>>>> ks.test(1:25, (10:30)-0.5, exact=FALSE)
>>>>
>>>> Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data: 1:25 and (10:30) - 0.5
>>>> D = 0.36, p-value = 0.1038
>>>> alternative hypothesis: two-sided
>>>>
>>>>> ks.test(1:25, (10:30)-0.5, exact=TRUE)
>>>>
>>>> Two-sample Kolmogorov-Smirnov test
>>>>
>>>> data: 1:25 and (10:30) - 0.5
>>>> D = 0.36, p-value = 0.07608
>>>> alternative hypothesis: two-sided
>>>>
>>>>
>>>>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
>>>> (0.35999999999999999, 0.078993426961291274)
>>>>
>>>>
>>>> For the 1 sample kstest I used (when I rewrote stats.kstest) an
>>>> approximation that is closer to the exact distribution than the
>>>> asymptotic distribution, but it's also not exact.
>>>>
>>>> It would be good to have better small sample approximations or exact
>>>> distributions, but I worked on this in scipy.stats when I barely had
>>>> any idea about goodness-of-fit tests.
>>>> Also, ks_2samp never got the enhancement for one-sided alternatives.
>>>> (In statsmodels I have been working so far only on one sample tests,
>>>> but not on two-sample tests.)
>>>>
>>>> (I don't remember if there is a minimum size recommendation, but the
>>>> examples I usually checked were larger.)
>>>
>>> matlab help: http://www.mathworks.com/help/stats/kstest2.html
>>> "The asymptotic p value becomes very accurate for large sample sizes,
>>> and is believed to be reasonably accurate for sample sizes n1 and n2
>>> such that (n1*n2)/(n1 + n2) >= 4."
>>>>
>>>>
>>>> since it's a community project: Pull Request are welcome
>
> (since I haven't spammed the mailing list in a while, and it's fast)
>
> I think, except in very small sample, ks_2samp looks ok.
>
> (However, independent of our p-values, the power of Kolmogorov-Smirnov
> in small samples is awful.
>>>> stats.ttest_ind(np.arange(1.,26), np.arange(10,31.)-0.5)
> (-3.2015129142545442, 0.0025398589272691129) reject Null
>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
> (0.35999999999999999, 0.078993426961291274) don't reject Null at 0.05
> )
Anderson-Darling has in most cases higher power than Kolmogorov-Smirnov
on my wishlist for a long time:
"K-sample Anderson–Darling tests"
http://scholar.google.com/scholar?cluster=1766356837451056669&hl=en&as_sdt=0,5&sciodt=0,5
volunteers?
Josef
>
> ---------------------------
> # -*- coding: utf-8 -*-
> """Comparing pvalues for ks_2samp in small sample, permutation test
>
> Created on Thu Nov 01 22:42:30 2012
> Author: Josef Perktold
>
> Results:
>
> sample: 1
> permutation 0.07704
> scipy 0.0789934269613
> R exact 0.07608
> R asymp 0.1038
>
> subset results
> [ 0.0774 0.0783 0.0732 0.0772 0.0775 0.0755 0.0791 0.0743 0.0816
> 0.0763]
>>>> execfile('permutations_ks1samp.py')
>
> sample: 0
> permutation 0.00791
> scipy 0.00378135405937
> R exact 0.007937
> R asymp 0.01348
>
> subset results
> [ 0.008 0.0094 0.0074 0.0071 0.0071 0.0084 0.0073 0.0084 0.0083
> 0.0077]
>
> """
>
> import numpy as np
> from scipy import stats
>
> sample = 0
>
> if sample == 0:
> x1, y1 = np.arange(1.,6), np.arange(11,16.)
> else:
> x1, y1 = np.arange(1.,26), np.arange(10,31.)-0.5
> nx = len(x1)
>
> D, pval = stats.ks_2samp(x1, y1)
>
> #permutation p-values
> n_rep = 100000 #make it large
> results = np.empty((n_rep, 2))
> results.fill(np.nan)
>
> x, y = x1.copy(), y1.copy()
> xy = np.concatenate((x,y))
> np.random.seed(2345678)
> for ii in xrange(n_rep):
> np.random.shuffle(xy)
> results[ii] = stats.ks_2samp(xy[:nx], xy[nx:])
>
> print '\nsample:', sample
> print 'permutation', (results[:,0] >= D).mean()
> print 'scipy ', pval
> if sample == 0:
> print 'R exact ', 0.007937
> print 'R asymp ', 0.01348
> else:
> print 'R exact ', 0.07608
> print 'R asymp ', 0.1038
>
> print '\nsubset results'
> print (results[:,0].reshape(10000, -1) >= D).mean(0)
> -------------------------
>
> Josef
>
>
>>>>
>>>> Josef
>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>> Peng
>>>>> _______________________________________________
>>>>> SciPy-User mailing list
>>>>> SciPy-User at scipy.org
>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
More information about the SciPy-User
mailing list