New subject: [SciPy-user] new Kolmogorov-Smirnov 2-sample test not Numerical Recipes based

3 Dec 2008


      On Wed, Dec 3, 2008 at 3:15 PM,  <josef.pktd@gmail.com> wrote:
...
On Wed, Dec 3, 2008 at 2:49 PM, Jarrod Millman <millman@berkeley.edu> wrote:
...
On Wed, Dec 3, 2008 at 11:43 AM, Matthew Brett <matthew.brett@gmail.com> wrote:
...
...
def ks_2samp(data1, data2):
   """ Computes the Kolmogorov-Smirnof statistic on 2 samples.  Modified
   from Numerical Recipies in C, page 493.  Returns KS D-value, prob.  Not
   ufunc- like.
Wait - really?  We can't use Numerical Recipes code, it has strict and
incompatible licensing...  If it's in there it really has to come out
as fast as possible.
http://www.nr.com/licenses/redistribute.html
I did a 2sample kstest based on the definition using search sorted,
see function below.
Attached is script file with standalone old and new versions and Monte
Carlo evaluation.
I didn't do any speed comparison.
The function works only for one dimensional data, neither does the old
one, but there is some
initial more than 1 dimension setup in the old version that I don't understand.

Main reference used (after googling):
http://math.ucsd.edu/~gptesler/283/kolmogorov_smirnov_05.pdf

If there is any interest, I can replace the existing implementation
with the new function

Josef

Conclusion:
Comparing old implementation based on numerical recipes with new implementation

* in 2 samples have same size then difference in KS statistic is less than 1e-14
* if sample sizes differ
  * in n1=2, n2=3 example new version is correct, old version is wrong
  * Monte Carlo essentially identical results (sample sizes between 100 and 1000
    rejection rates for alpha = 1%,5%,10% sometimes differ in 3rd decimal
...
...
...
data1 = np.array([1.0,2.0])
data2 = np.array([1.0,2.0,3.0])
ks_2samp_new(data1+0.01,data2)
(0.33333333333333337, 0.99062316386915694)
ks_2samp(data1+0.01,data2)
(0.33333333333333331, 0.99062316386915694)
ks_2samp_new(data1-0.01,data2)
(0.66666666666666674, 0.42490954988801982)
ks_2samp(data1-0.01,data2)  # result not correct
(-0.5, 0.77962787254643151)
===============
def ks_2samp_new(data1, data2):
    """ Computes the Kolmogorov-Smirnof statistic on 2 samples.

    Returns: KS D-value, p-value
    """
    data1, data2 = map(asarray, (data1, data2))
    n1 = data1.shape[0]
    n2 = data2.shape[0]
    n1 = len(data1)
    n2 = len(data2)
    data1 = np.sort(data1)
    data2 = np.sort(data2)
    data_all = np.concatenate([data1,data2])
    cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
    cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
    d = np.max(np.absolute(cdf1-cdf2))
    #Note: d absolute not signed distance
    en = np.sqrt(n1*n2/float(n1+n2))
    try:
        prob = ksprob((en+0.12+0.11/en)*d)
    except:
        prob = 1.0
    return d, prob
====================

new Kolmogorov-Smirnov 2-sample test not Numerical Recipes based

josef.pktd＠gmail.com

Matthew Brett

josef.pktd＠gmail.com

Robert Kern

David Cournapeau

tags

participants (4)