[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

Thu Mar 19 07:42:58 EDT 2009

On 3/19/2009 2:58 AM, josef.pktd at gmail.com wrote:

> I have a hard time following the double loop and slice indices inside
> the double loop. It would be fast and safe on memory since it runs
> through the double loop only once, but hard to read, debug and
> maintain.

We start with a table of counts. Given a pivot with index (j,i), we can 
split the table into 8 fields. When counting pairs only once, concordant 
pairs are to the lower right and disconcordant to the lower left. Pairs 
with tied-y are to the right and pairs with tied x are below.

import numpy as np
from math import sqrt

import numpy as np
from math import sqrt

def taub(table):

     def _table(j,i):
        return {
          #'above' :       table[:j,i],
          'below' :       table[j+1:,i],
          #'left' :        table[j,:i],
          'right' :       table[j,i+1:],
          #'upper-left' :  table[:j,:i],
          #'upper-right' : table[:j,i+1:],
          'lower-left' :  table[j+1:,:i],
          'lower-right' : table[j+1:,i+1:] }

     D = table.shape[0]
     C = table.shape[1]
     P = 0
     Q = 0
     Tx = 0
     Ty = 0
     for i in range(C):
         for j in range(D):

             pivot = table[j,i] # use this as pivot
             ct = _table(j,i)   # split remainder into 8 sections

             # count concordant pairs
             # -- multiply pivot with 'lower-right' and summate

             P += np.sum(ct['lower-right'] * pivot)

             # count disconcordant pairs
             # -- multiply pivot with 'lower-left' and summate

             Q += np.sum(ct['lower-left'] * pivot)

             # count pairs tied in y
             # -- multiply pivot with 'right' and summate

             Ty += np.sum(ct['right'] * pivot)

             # count pairs tied in x
             # -- multiply pivot with 'lower' and summate

             Tx += np.sum(ct['below'] * pivot)

     return float(P-Q)/(sqrt((P+Q+Tx)*(P+Q+Ty)))

It seems I did a mistake in the Fortran. I counted upper-right as 
concordant where I should have counted lower-right (!#"¤#"%...):

subroutine taub(C, D, table, tau)
     implicit none
     intrinsic :: sqrt, sum
     integer*4, intent(in) :: C, D
     integer*4, dimension(C,D), intent(in) :: table
     real*8, intent(out) :: tau
     integer*4 :: i, j, pivot, P, Q, Tx, Ty
     P = 0
     Q = 0
     Tx = 0
     Ty = 0
     !$omp  parallel do &
     !$omp& private(i,j,tmp) &
     !$omp& reduction(+:P,Q,Tx,Ty) &
     !$omp& shared(table,C,D)
     do i = 1,D
         do j = 1,C
             pivot = table(j,i)
             ! count concordant pairs
             if ((i .lt. D) .and. (j .lt. C)) then
                 P = P + sum(table(j+1:C,i+1:D)*pivot)
             end if
             ! count disconcordant pairs
             if ((i .gt. 1) .and. (j .lt. C)) then
                 Q = Q + sum(table(j+1:C,1:i-1)*pivot)
             end if
             ! count pairs tied in y
             if (i .lt. D) then
                 Ty = Ty + sum(table(j,i+1:D)*pivot)
             end if
             ! count pairs tied in x
             if (j .lt. C) then
                 Tx = Tx + sum(table(j+1:C,i)*pivot)
             end if
         end do
     end do
     !$omp end parallel do
     tau = real(P-Q)/sqrt(real((P+Q+Tx)*(P+Q+Ty)))
end subroutine

> def kendalltau_fromct(x, y, count):
>     '''return tau-a, tau-b and tau-c from contingency table data
> 
>     example for contingency table
>     x = np.array([1,1,1,2,2,2])
>     y = np.array([1,2,3,1,2,3])
>     count = np.array([10, 5, 2, 9, 12, 16])
>     '''

It looks good, but it can be terribly slow. E.g. when I run it with say 
an contigency table of 100 categories for X and 100 categoried for Y, it 
takes for ever and finally raises a MemoryError. Whereas my version with 
loops takes just 2 seconds. On the other hand, if  we go furter up to a 
100 x 1000 ct, the looped version takes 90 seconds. Yours exits 
immediately with 'ValueError: broadcast dimensions too large.'

An other thing is that the input is a bit difficult. Perhaps we could 
have flattened count and calculated x and y inside the function? Then it 
would just take a 2D array of counts as input.

def kendalltau_fromct(count):
     '''return tau-a, tau-b and tau-c from contingency table data

     example for contingency table

     count = np.array([[10,  5,  2],
                       [ 9, 12, 16]])

     '''

     assert(count.ndim == 2)
     ny = count.shape[0]
     nx = count.shape[1]
     x, y = np.meshgrid(range(nx),range(ny))
     x = x.flatten()
     y = y.flatten()
     count = count.flatten()

     catx = np.unique(x)
     caty = np.unique(y)
     ncatx = len(catx)
     ncaty = len(caty)

     deltax = np.sign(x[:,np.newaxis] - x)
     deltay = np.sign(y[:,np.newaxis] - y)

     paircount = count[:,np.newaxis]*count - np.diag(count)
     # number of concordant minus number of discordant pairs
     netconc = np.sum((deltax*deltay)*paircount)

     # calculation for tau-a
     tau_a = netconc/(1.*paircount.sum())

     # calculation for tau-c
     m = min(ncatx,ncaty)
     tau_c = netconc /(1.*count.sum()**2)* m/(m-1.0)

     #extra calculation for tau_b
     # row and column counts of contingency table
     countx = np.dot(count,(x[:,np.newaxis]==catx))
     county = np.dot(count,(y[:,np.newaxis]==caty))
     #total number of pairs
     npairs = paircount.sum()
     #number of ties
     ntiex = np.dot(countx,(countx-1))
     ntiey = np.dot(county,(county-1))

     denom = 1.0*np.sqrt((npairs - ntiex ) * (npairs - ntiey))
     tau_b = netconc / denom

     return tau_a, tau_b, tau_c

Anyhow, I think Fortran or Cython is really needed here.

Sturla