Re: [SciPy-Dev] GSoC Draft Proposal: Rewrite and improve cluster package in Cython

March 18, 2014


      On Tue, Mar 18, 2014 at 10:03 AM, Richard Tsai <richard9404@gmail.com>wrote:
...
2014-03-18 5:29 GMT+08:00 Ralf Gommers <ralf.gommers@gmail.com>:
...
On Mon, Mar 17, 2014 at 9:31 AM, Richard Tsai <richard9404@gmail.com>wrote:
...
Hi,
I looked at several ML packages and I found that ELKI has implemented a
optimized single linkage algorithm called SLINK[1][2]. And I also found a
similar algorithm called CLINK[3], which is for complete linkage. It seems
that these two algorithms are much faster and use less memory than the
naive algorithms we are using in cluster.hierarchy currently.
Looks like ELKI as a mix of licenses including BSD, but the default is
AGPL. Did you check for this algorithm? Not entirely clear from the above
if you planned to integrate or reimplement it, for the former it matters.
The SLINK java implementation in ELKI should be under AGPL but these two
algorithms were published decades ago and should be not patented.
...
...
I also read some IPython notebooks and StackOverflow posts recently and
I found that many people are discussing how to plot a heatmap of
hierarchical clustering. I think if we integrate it into cluster.hierarchy,
it will be a good complement to hierarchy.dendrogram.
Assuming that it doesn't take too much time to implement (plotting
shouldn't be a focus), that sounds fine. There are some more functions in
several packages that optionally use MPL.
...
Besides, I noticed that the cluster package is single-threaded
currently. I don't know if parallelization in scipy level rather than BLAS
level is proper,
That has been out of scope for scipy until now.
...
but at least we can just make use of the BLAS library (if it supports)
to parallelize the kmeans algorithm.
Are you talking about the for-loop over observations in vq()? I don't see
any linalg going on in kmeans.
We can expand the formula when calculating the distances then make use of
some BLAS functions. I've just written a demo of it:
https://gist.github.com/richardtsai/9614846 (written casually, a bit
messy)
It runs about 16x faster than the original C version in a 100000x500
dataset with k = 30. (Built with a thread-enabled ATLAS)
In [30]: X.shape
Out[30]: (100000, 500)
In [31]: c.shape
Out[31]: (30, 500)
In [32]: %timeit _vq.vq(X, c)
1 loops, best of 3: 1.28 s per loop
In [33]: %timeit _vq_rewrite.vq(X, c)
10 loops, best of 3: 79.6 ms per loop
OK clear. That's a pretty decent speed-up:)

Ralf
...
...
Ralf
...
[1]:
http://elki.dbs.ifi.lmu.de/releases/release0.6.0/doc/de/lmu/ifi/dbs/elki/alg...
[2]: http://www.cs.ucsb.edu/~veronika/MAE/SLINK_sibson.pdf
[3]: http://comjnl.oxfordjournals.org/content/20/4/364.abstract
Regards,
Richard
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-dev