[SciPy-dev] Possible Error in Kendall's Tau (scipy.stats.stats.kendalltau)

Wed Mar 18 16:13:11 EDT 2009

On Wed, Mar 18, 2009 at 1:53 PM, Sturla Molden <sturla at molden.no> wrote:
> On 3/18/2009 6:16 PM, josef.pktd at gmail.com wrote:
>
>> You could still have O(C*D) > O(N**2), if the table is sparse, and you
>> haven't deleted the empty rows and columns.
>
> Yes. So what is the faster option depends on N and the number of ordinal
> categories. But often we have C*D << N**2. If N is a million and 100
> categories suffice, it is easy to do the math.
>
> Also, it is possible to estimate tau by Monte Carlo.
>

I got a (number of cells)**2 version for the contingency table for
Kendall's tau-a, that's as far as I could go without loops. I don't
see how you could get O(C*D) and not O((C*D)**2) since you still need
to compare all pairs of cells, so my impression is that the relevant
comparison is between C*D and N.

Josef

# contingency table
violence = np.array([1,1,1,2,2,2])
rating = np.array([1,2,3,1,2,3])
count = np.array([10, 5, 2, 9, 12, 16])

 #  individual observation
vi = np.repeat(violence,count)
ra = np.repeat(rating,count)

# tau-a calculated using contingency table from example
# creates arrays of size (number of cells)**2, no loops but (almost
50%) redundant points

deltax = violence[:,np.newaxis] - violence
deltay = rating[:,np.newaxis] - rating
paircount = count[:,np.newaxis]*count - np.diag(count)
tau_a = np.sum(np.sign(deltax*deltay)*paircount)/(1.*paircount.sum())
print tau_a, kendalltaustata(vi,ra)