[SciPy-User] ks_2samp is not giving the same results as ks.test in R

josef.pktd at gmail.com josef.pktd at gmail.com
Thu Nov 1 21:41:11 EDT 2012


On Thu, Nov 1, 2012 at 9:14 PM,  <josef.pktd at gmail.com> wrote:
> On Thu, Nov 1, 2012 at 8:28 PM, Peng Yu <pengyu.ut at gmail.com> wrote:
>> Hi,
>>
>> The ks_2samp function does not give the same answer as ks.test in R.
>> Does anybody know why they are different? Is ks_2samp compute
>> something different?
>>
>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ Rscript main.R
>>> ks.test(1:5, 11:15)
>>
>>         Two-sample Kolmogorov-Smirnov test
>>
>> data:  1:5 and 11:15
>> D = 1, p-value = 0.007937
>> alternative hypothesis: two-sided
>>
>>> ks.test(1:5, 11:15, alternative='less')
>>
>>         Two-sample Kolmogorov-Smirnov test
>>
>> data:  1:5 and 11:15
>> D^- = 0, p-value = 1
>> alternative hypothesis: the CDF of x lies below that of y
>>
>>> ks.test(1:5, 11:15, alternative='greater')
>>
>>         Two-sample Kolmogorov-Smirnov test
>>
>> data:  1:5 and 11:15
>> D^+ = 1, p-value = 0.006738
>> alternative hypothesis: the CDF of x lies above that of y
>>
>>>
>>>
>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ ./main.py
>> (1.0, 0.0037813540593701006)
>> helium:~/linux/test/python/man/library/scipy/stats/ks_2samp$ cat main.py
>> #!/usr/bin/env python
>>
>> from scipy.stats import ks_2samp
>> print ks_2samp([1,2,3,4,5], [11,12,13,14,15])
>
> R uses by default an "exact" distribution for small samples if there
> are no ties.
> If there are ties or with a large sample, R uses the asymptotic distribution.
>
> If I read the function correctly, then scipy.stats is using a small
> sample approximation by Stephens. (But I would have to look up the
> formula to verify this.)

http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
has the weighted sample size: en = np.sqrt(n1*n2/float(n1+n2))
the small sample weighting ((en+0.12+0.11/en)*d) is the same as in
Stephens (1970, 1985?) for the one sample test.
I don't have a reference for the two sample approximation right now.

(another bit of random information)
tables are often only available for 0.01 to 0.25 and approximations
are targeted on that range and might not be as accurate outside of it

Josef


>
> In the example below with a bit larger sample and no ties, our
> approximation is closer to R's "exact" pvalue than the asymptotic
> distribution if exact=FALSE.
>
>>  ks.test(1:25, (10:30)-0.5, exact=FALSE)
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:25 and (10:30) - 0.5
> D = 0.36, p-value = 0.1038
> alternative hypothesis: two-sided
>
>>  ks.test(1:25, (10:30)-0.5, exact=TRUE)
>
>         Two-sample Kolmogorov-Smirnov test
>
> data:  1:25 and (10:30) - 0.5
> D = 0.36, p-value = 0.07608
> alternative hypothesis: two-sided
>
>
>>>> stats.ks_2samp(np.arange(1.,26), np.arange(10,31.)-0.5)
> (0.35999999999999999, 0.078993426961291274)
>
>
> For the 1 sample kstest I used (when I rewrote stats.kstest) an
> approximation that is closer to the exact distribution than the
> asymptotic distribution, but it's also not exact.
>
> It would be good to have better small sample approximations or exact
> distributions, but I worked on this in scipy.stats when I barely had
> any idea about goodness-of-fit tests.
> Also, ks_2samp never got the enhancement for one-sided alternatives.
> (In statsmodels I have been working so far only on one sample tests,
> but not on two-sample tests.)
>
> (I don't remember if there is a minimum size recommendation, but the
> examples I usually checked were larger.)

matlab help: http://www.mathworks.com/help/stats/kstest2.html
"The asymptotic p value becomes very accurate for large sample sizes,
and is believed to be reasonably accurate for sample sizes n1 and n2
such that (n1*n2)/(n1 + n2) >= 4."
>
>
> since it's a community project: Pull Request are welcome
>
> Josef
>
>>
>>
>> --
>> Regards,
>> Peng
>> _______________________________________________
>> SciPy-User mailing list
>> SciPy-User at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-user



More information about the SciPy-User mailing list