Fast (O(n log(n)) ) implementation of Kendall Tau
Dear all, A few days ago I posted to http://projects.scipy.org/scipy/ticket/999 a drop-in replacement for scipy.stats.kendalltau. My code implements the algorithm with complexity O(n log()) described by William R. Knight in a paper of 1966 archived at http://www.jstor.org/pss/2282833 , whereas the function currently in SciPy has complexity O(n^2), which makes it unusable with large data sets. One potential issue here is that my code is in part derived from a Java implementation contained in the Java package called "Law" (http://law.dsi.unimi.it/software/law-1.4-src.tar.gz ). That code is GPL'd, but its authors (http://law.dsi.unimi.it/index.php?option=com_contact&catid=4&Itemid=3 ) have already informally confirmed to me by e-mail that they have no objections to the release of my Python code under a BSD-style license. Now my question is: exactly which steps should I take in order to clear the path to a possible inclusion in SciPy? All the best, Enzo Michelangeli
2009/9/30 Enzo Michelangeli <enzomich@gmail.com>: A few days ago I posted to http://projects.scipy.org/scipy/ticket/999 a drop-in replacement for scipy.stats.kendalltau.
It was a good idea to post this note on the mailing list. I don't think Trac is sending out notifications of new tickets, so it's likely that no-one has spotted your contribution yet. Cheers, Scott
Enzo Michelangeli skrev:
Dear all,
A few days ago I posted to http://projects.scipy.org/scipy/ticket/999 a drop-in replacement for scipy.stats.kendalltau. My code implements the algorithm with complexity O(n log()) described by William R. Knight in a paper of 1966 archived at http://www.jstor.org/pss/2282833 , whereas the function currently in SciPy has complexity O(n^2), which makes it unusable
There is also: http://projects.scipy.org/scipy/ticket/893 It has a contigency table version that would be fast for large data sets, in theory O(n). Sturla Molden
From: "Sturla Molden" <sturla@molden.no> Sent: Thursday, October 01, 2009 6:03 AM
Enzo Michelangeli skrev:
Dear all,
A few days ago I posted to http://projects.scipy.org/scipy/ticket/999 a drop-in replacement for scipy.stats.kendalltau. My code implements the algorithm with complexity O(n log()) described by William R. Knight in a paper of 1966 archived at http://www.jstor.org/pss/2282833 , whereas the function currently in SciPy has complexity O(n^2), which makes it unusable
There is also:
http://projects.scipy.org/scipy/ticket/893
It has a contigency table version that would be fast for large data sets, in theory O(n).
Yes, but that requires a native module. Enzo
Enzo Michelangeli skrev:
Dear all,
A few days ago I posted to http://projects.scipy.org/scipy/ticket/999 a drop-in replacement for scipy.stats.kendalltau. Why do you re-implement mergesort in pure Python?
ndarrays have a sort method that can use mergesort. Python lists has the same (timsort is mergesort on steroids).
From: "Sturla Molden" <sturla@molden.no> Sent: Thursday, October 01, 2009 6:09 AM
Enzo Michelangeli skrev:
Dear all,
A few days ago I posted to http://projects.scipy.org/scipy/ticket/999 a drop-in replacement for scipy.stats.kendalltau. Why do you re-implement mergesort in pure Python?
ndarrays have a sort method that can use mergesort.
Python lists has the same (timsort is mergesort on steroids).
Because Knight's algorithm needs to count the number of swaps (or, to be more precise, the number of swaps that would be performed by an equivalent bubblesort). In the code, that's the purpose of the variable exchcnt . An alternative algorithm for the Kendall Tau developed by David Christensen, and described at http://www.springerlink.com/content/p33qu44058984082/ (with a Delphi implementation), replaces the mergesort step with one based on balanced binary trees (AVL in his case, but I guess RBT would also work). Unfortunately, neither the standard Python library nor NumPy/SciPy appear to implement such useful data structures, and what is available either doesn't allow O(log(n)) random access (heapq) or lacks a O(log(n)) insert (sorted lists accessed through bisect). Enzo
On Wed, Sep 30, 2009 at 10:07 PM, Enzo Michelangeli <enzomich@gmail.com>wrote:
From: "Sturla Molden" <sturla@molden.no> Sent: Thursday, October 01, 2009 6:09 AM
Enzo Michelangeli skrev:
Dear all,
A few days ago I posted to http://projects.scipy.org/scipy/ticket/999 a drop-in replacement for scipy.stats.kendalltau. Why do you re-implement mergesort in pure Python?
ndarrays have a sort method that can use mergesort.
Python lists has the same (timsort is mergesort on steroids).
Because Knight's algorithm needs to count the number of swaps (or, to be more precise, the number of swaps that would be performed by an equivalent bubblesort). In the code, that's the purpose of the variable exchcnt .
An alternative algorithm for the Kendall Tau developed by David Christensen, and described at http://www.springerlink.com/content/p33qu44058984082/(with a Delphi implementation), replaces the mergesort step with one based on balanced binary trees (AVL in his case, but I guess RBT would also work). Unfortunately, neither the standard Python library nor NumPy/SciPy appear to implement such useful data structures, and what is available either doesn't
Yep, we could use some more of those useful data structures. A "computer science" library would be useful. The scipy.spacial library is step in that direction but it would be nice if there were more such. Chuck
participants (4)
-
Charles R Harris
-
Enzo Michelangeli
-
Scott Sinclair
-
Sturla Molden