
Hi all, I wrote a draft proposal for my GSoC about the cluster package. I post to the list hoping for advice. However, as Ralf said, cluster is not well maintained now. And I am still not be able to find someone who know about cluster analysis to mentor me. If you have any suggestions for my proposal, or are willing to mentor me, please let me know and I will be really grateful. Regards, Richard Proposal Title: SciPy: Rewrite and improve cluster package in Cython Proposal Abstract According to the roadmap to SciPy 1.0, the cluster package needs a Cython rewrite the make it more maintainable and efficient. Besides, there's room for improvement in cluster.vq module. Some useful features can be added and the performance can be improved when dealing with large datasets. Proposal Detailed Description/Timeline There's an experimental Cython implementation of the vq module in the source tree. However, it has not been maintained for about 2 years and it only supports single precision datasets, and it's also slower than the original implementation. I plan to start with some cleanup job, then finish the double precision support. After some optimizations and tuning it should be mature enough the replace the original implementation. After that, I'm going to implementation a mini-batch optimization for kmeans/kmeans2 function based on a paper ("Web-Scale K-Means Clustering<http://dl.acm.org/citation.cfm?id=1772862>") and it should greatly improve the performance for large datasets. In addition, I think the support for automatically determine the number of clusters via some methods (e.g. gap statistics<http://www.stanford.edu/~hastie/Papers/gap.pdf>) can be included in this module. As for the hierarchy module, it is rather full-featured now, but the Cython rewrite has yet begun. I'll rewrite the high level part in Cython first since it it convenient to call the original C underlying functions in Cython code. I'll migrate the underlying part from C to Cython gradually at last. My detailed timeline is as follows. - Week 1: Do some cleanup for the existing experimental Cython version of vq (bugs, docs, etc.), unit tests, performance benchmarks for datasets of various sizes and distributions. - Week 2: Finish the double precision support in the Cython version of vq, try to migrate some Python code to Cython to gain performance improvement. - Week 3: Do some performance profiling, continue to optimize the performance of vq, try to replace the original C implementation with the new Cython implementation. - Week 4: Implement the mini-batch K-means algorithm. - Week 5: Add support for automatically determine the number of clusters. - Week 6: Maneuver time. Finish the work that is behind schedule, and try some potential optimizations. - Week 7: Build a framework for the Cython implementation of the hierarchy module. The work should be just translate the wrapper functions in hierarchy_wrap.c into Cython so there may be no performance gains by then. - Week 8-9: Rewrite the underlying implementation of the hierarchy module in Cython. The major work is to translate hierarchy.c into Cython. - Week 10: Optimize the Cython implementation of the hierarchy module, replace the original implementation if possible. - Remaining time (if there is): Improve the documents, add some sample code especially for the hierarchy module. Code Sample My previous patches to SciPy can be found in https://github.com/scipy/scipy/pulls/richardtsai?state=closed I haven't submitted code to the cluster package but I'll probably make a related PR soon.