[SciPy-Dev] GSoC Draft Proposal: Rewrite and improve cluster package in Cython

March 13, 2014

      Hi all,
I wrote a draft proposal for my GSoC about the cluster package. I post to
the list hoping for advice. However, as Ralf said, cluster is not well
maintained now. And I am still not be able to find someone who know about
cluster analysis to mentor me. If you have any suggestions for my proposal,
or are willing to mentor me, please let me know and I will be really
grateful.

Regards,
Richard
Proposal Title: SciPy: Rewrite and improve cluster package in Cython
Proposal Abstract

According to the roadmap to SciPy 1.0, the cluster package needs a Cython
rewrite the make it more maintainable and efficient. Besides, there's room
for improvement in cluster.vq module. Some useful features can be added and
the performance can be improved when dealing with large datasets.
Proposal Detailed Description/Timeline

There's an experimental Cython implementation of the vq module in the
source tree. However, it has not been maintained for about 2 years and it
only supports single precision datasets, and it's also slower than the
original implementation.

I plan to start with some cleanup job, then finish the double precision
support. After some optimizations and tuning it should be mature enough the
replace the original implementation.

After that, I'm going to implementation a mini-batch optimization
for kmeans/kmeans2 function based on a paper ("Web-Scale K-Means
Clustering<http://dl.acm.org/citation.cfm?id=1772862>")
and it should greatly improve the performance for large datasets. In
addition, I think the support for automatically determine the number of
clusters via some methods (e.g. gap
statistics<http://www.stanford.edu/~hastie/Papers/gap.pdf>)
can be included in this module.

As for the hierarchy module, it is rather full-featured now, but the Cython
rewrite has yet begun. I'll rewrite the high level part in Cython first
since it it convenient to call the original C underlying functions in
Cython code. I'll migrate the underlying part from C to Cython gradually at
last.

My detailed timeline is as follows.

   - Week 1: Do some cleanup for the existing experimental Cython version
   of vq (bugs, docs, etc.), unit tests, performance benchmarks for datasets
   of various sizes and distributions.
   - Week 2: Finish the double precision support in the Cython version
   of vq, try to migrate some Python code to Cython to gain performance
   improvement.
   - Week 3: Do some performance profiling, continue to optimize the
   performance of vq, try to replace the original C implementation with the
   new Cython implementation.
   - Week 4: Implement the mini-batch K-means algorithm.
   - Week 5: Add support for automatically determine the number of clusters.
   - Week 6: Maneuver time. Finish the work that is behind schedule, and
   try some potential optimizations.
   - Week 7: Build a framework for the Cython implementation of
   the hierarchy module. The work should be just translate the wrapper
   functions in hierarchy_wrap.c into Cython so there may be no performance
   gains by then.
   - Week 8-9: Rewrite the underlying implementation of
   the hierarchy module in Cython. The major work is to
   translate hierarchy.c into Cython.
   - Week 10: Optimize the Cython implementation of the hierarchy module,
   replace the original implementation if possible.
   - Remaining time (if there is): Improve the documents, add some sample
   code especially for the hierarchy module.

Code Sample

My previous patches to SciPy can be found in
https://github.com/scipy/scipy/pulls/richardtsai?state=closed
I haven't submitted code to the cluster package but I'll probably make a
related PR soon.

[SciPy-Dev] GSoC Draft Proposal: Rewrite and improve cluster package in Cython

Richard Tsai