[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)
Sturla Molden
sturla at molden.no
Thu Mar 19 07:42:58 EDT 2009
On 3/19/2009 2:58 AM, josef.pktd at gmail.com wrote:
> I have a hard time following the double loop and slice indices inside
> the double loop. It would be fast and safe on memory since it runs
> through the double loop only once, but hard to read, debug and
> maintain.
We start with a table of counts. Given a pivot with index (j,i), we can
split the table into 8 fields. When counting pairs only once, concordant
pairs are to the lower right and disconcordant to the lower left. Pairs
with tied-y are to the right and pairs with tied x are below.
import numpy as np
from math import sqrt
import numpy as np
from math import sqrt
def taub(table):
def _table(j,i):
return {
#'above' : table[:j,i],
'below' : table[j+1:,i],
#'left' : table[j,:i],
'right' : table[j,i+1:],
#'upper-left' : table[:j,:i],
#'upper-right' : table[:j,i+1:],
'lower-left' : table[j+1:,:i],
'lower-right' : table[j+1:,i+1:] }
D = table.shape[0]
C = table.shape[1]
P = 0
Q = 0
Tx = 0
Ty = 0
for i in range(C):
for j in range(D):
pivot = table[j,i] # use this as pivot
ct = _table(j,i) # split remainder into 8 sections
# count concordant pairs
# -- multiply pivot with 'lower-right' and summate
P += np.sum(ct['lower-right'] * pivot)
# count disconcordant pairs
# -- multiply pivot with 'lower-left' and summate
Q += np.sum(ct['lower-left'] * pivot)
# count pairs tied in y
# -- multiply pivot with 'right' and summate
Ty += np.sum(ct['right'] * pivot)
# count pairs tied in x
# -- multiply pivot with 'lower' and summate
Tx += np.sum(ct['below'] * pivot)
return float(P-Q)/(sqrt((P+Q+Tx)*(P+Q+Ty)))
It seems I did a mistake in the Fortran. I counted upper-right as
concordant where I should have counted lower-right (!#"¤#"%...):
subroutine taub(C, D, table, tau)
implicit none
intrinsic :: sqrt, sum
integer*4, intent(in) :: C, D
integer*4, dimension(C,D), intent(in) :: table
real*8, intent(out) :: tau
integer*4 :: i, j, pivot, P, Q, Tx, Ty
P = 0
Q = 0
Tx = 0
Ty = 0
!$omp parallel do &
!$omp& private(i,j,tmp) &
!$omp& reduction(+:P,Q,Tx,Ty) &
!$omp& shared(table,C,D)
do i = 1,D
do j = 1,C
pivot = table(j,i)
! count concordant pairs
if ((i .lt. D) .and. (j .lt. C)) then
P = P + sum(table(j+1:C,i+1:D)*pivot)
end if
! count disconcordant pairs
if ((i .gt. 1) .and. (j .lt. C)) then
Q = Q + sum(table(j+1:C,1:i-1)*pivot)
end if
! count pairs tied in y
if (i .lt. D) then
Ty = Ty + sum(table(j,i+1:D)*pivot)
end if
! count pairs tied in x
if (j .lt. C) then
Tx = Tx + sum(table(j+1:C,i)*pivot)
end if
end do
end do
!$omp end parallel do
tau = real(P-Q)/sqrt(real((P+Q+Tx)*(P+Q+Ty)))
end subroutine
> def kendalltau_fromct(x, y, count):
> '''return tau-a, tau-b and tau-c from contingency table data
>
> example for contingency table
> x = np.array([1,1,1,2,2,2])
> y = np.array([1,2,3,1,2,3])
> count = np.array([10, 5, 2, 9, 12, 16])
> '''
It looks good, but it can be terribly slow. E.g. when I run it with say
an contigency table of 100 categories for X and 100 categoried for Y, it
takes for ever and finally raises a MemoryError. Whereas my version with
loops takes just 2 seconds. On the other hand, if we go furter up to a
100 x 1000 ct, the looped version takes 90 seconds. Yours exits
immediately with 'ValueError: broadcast dimensions too large.'
An other thing is that the input is a bit difficult. Perhaps we could
have flattened count and calculated x and y inside the function? Then it
would just take a 2D array of counts as input.
def kendalltau_fromct(count):
'''return tau-a, tau-b and tau-c from contingency table data
example for contingency table
count = np.array([[10, 5, 2],
[ 9, 12, 16]])
'''
assert(count.ndim == 2)
ny = count.shape[0]
nx = count.shape[1]
x, y = np.meshgrid(range(nx),range(ny))
x = x.flatten()
y = y.flatten()
count = count.flatten()
catx = np.unique(x)
caty = np.unique(y)
ncatx = len(catx)
ncaty = len(caty)
deltax = np.sign(x[:,np.newaxis] - x)
deltay = np.sign(y[:,np.newaxis] - y)
paircount = count[:,np.newaxis]*count - np.diag(count)
# number of concordant minus number of discordant pairs
netconc = np.sum((deltax*deltay)*paircount)
# calculation for tau-a
tau_a = netconc/(1.*paircount.sum())
# calculation for tau-c
m = min(ncatx,ncaty)
tau_c = netconc /(1.*count.sum()**2)* m/(m-1.0)
#extra calculation for tau_b
# row and column counts of contingency table
countx = np.dot(count,(x[:,np.newaxis]==catx))
county = np.dot(count,(y[:,np.newaxis]==caty))
#total number of pairs
npairs = paircount.sum()
#number of ties
ntiex = np.dot(countx,(countx-1))
ntiey = np.dot(county,(county-1))
denom = 1.0*np.sqrt((npairs - ntiex ) * (npairs - ntiey))
tau_b = netconc / denom
return tau_a, tau_b, tau_c
Anyhow, I think Fortran or Cython is really needed here.
Sturla
More information about the SciPy-Dev
mailing list