GSoC Draft Proposal: Rewrite and improve cluster package in Cython

Hi all, I wrote a draft proposal for my GSoC about the cluster package. I post to the list hoping for advice. However, as Ralf said, cluster is not well maintained now. And I am still not be able to find someone who know about cluster analysis to mentor me. If you have any suggestions for my proposal, or are willing to mentor me, please let me know and I will be really grateful. Regards, Richard Proposal Title: SciPy: Rewrite and improve cluster package in Cython Proposal Abstract According to the roadmap to SciPy 1.0, the cluster package needs a Cython rewrite the make it more maintainable and efficient. Besides, there's room for improvement in cluster.vq module. Some useful features can be added and the performance can be improved when dealing with large datasets. Proposal Detailed Description/Timeline There's an experimental Cython implementation of the vq module in the source tree. However, it has not been maintained for about 2 years and it only supports single precision datasets, and it's also slower than the original implementation. I plan to start with some cleanup job, then finish the double precision support. After some optimizations and tuning it should be mature enough the replace the original implementation. After that, I'm going to implementation a mini-batch optimization for kmeans/kmeans2 function based on a paper ("Web-Scale K-Means Clustering<http://dl.acm.org/citation.cfm?id=1772862>") and it should greatly improve the performance for large datasets. In addition, I think the support for automatically determine the number of clusters via some methods (e.g. gap statistics<http://www.stanford.edu/~hastie/Papers/gap.pdf>) can be included in this module. As for the hierarchy module, it is rather full-featured now, but the Cython rewrite has yet begun. I'll rewrite the high level part in Cython first since it it convenient to call the original C underlying functions in Cython code. I'll migrate the underlying part from C to Cython gradually at last. My detailed timeline is as follows. - Week 1: Do some cleanup for the existing experimental Cython version of vq (bugs, docs, etc.), unit tests, performance benchmarks for datasets of various sizes and distributions. - Week 2: Finish the double precision support in the Cython version of vq, try to migrate some Python code to Cython to gain performance improvement. - Week 3: Do some performance profiling, continue to optimize the performance of vq, try to replace the original C implementation with the new Cython implementation. - Week 4: Implement the mini-batch K-means algorithm. - Week 5: Add support for automatically determine the number of clusters. - Week 6: Maneuver time. Finish the work that is behind schedule, and try some potential optimizations. - Week 7: Build a framework for the Cython implementation of the hierarchy module. The work should be just translate the wrapper functions in hierarchy_wrap.c into Cython so there may be no performance gains by then. - Week 8-9: Rewrite the underlying implementation of the hierarchy module in Cython. The major work is to translate hierarchy.c into Cython. - Week 10: Optimize the Cython implementation of the hierarchy module, replace the original implementation if possible. - Remaining time (if there is): Improve the documents, add some sample code especially for the hierarchy module. Code Sample My previous patches to SciPy can be found in https://github.com/scipy/scipy/pulls/richardtsai?state=closed I haven't submitted code to the cluster package but I'll probably make a related PR soon.

Hi, FWIW, I think this is a pretty good proposal, but I worry that some of it duplicates work that's already taken place in scikit-learn. I think that a high-performance vq module is an important thing to have in SciPy itself (though Jake Vanderplas did some work on distance computations in Cython for scikit-learn that should be leveraged if possible, maybe Jake has thoughts on factoring it into a separate package?) and to my knowledge, the hierarchy module is not duplicated to a great extent in scikit-learn. I'd thus prioritize those two things, *including* sprucing up their documentation (SciPy is a fairly mature project, and one where documentation is, ideally, not an afterthought). Things like mini-batch k-means and automatic determination of k are interesting but more scikit-learn territory. I would leave these things to the end, on an if-there's-time basis. Since that _vq_rewrite was written, Cython has introduced much cleaner memoryviews. Definitely prefer those over the deprecated ndarray syntax. On Thu, Mar 13, 2014 at 11:08 AM, Richard Tsai <richard9404@gmail.com> wrote:
Hi all, I wrote a draft proposal for my GSoC about the cluster package. I post to the list hoping for advice. However, as Ralf said, cluster is not well maintained now. And I am still not be able to find someone who know about cluster analysis to mentor me. If you have any suggestions for my proposal, or are willing to mentor me, please let me know and I will be really grateful.
Regards, Richard
Proposal Title: SciPy: Rewrite and improve cluster package in Cython
Proposal Abstract
According to the roadmap to SciPy 1.0, the cluster package needs a Cython rewrite the make it more maintainable and efficient. Besides, there's room for improvement in cluster.vq module. Some useful features can be added and the performance can be improved when dealing with large datasets.
Proposal Detailed Description/Timeline
There's an experimental Cython implementation of the vq module in the source tree. However, it has not been maintained for about 2 years and it only supports single precision datasets, and it's also slower than the original implementation.
I plan to start with some cleanup job, then finish the double precision support. After some optimizations and tuning it should be mature enough the replace the original implementation.
After that, I'm going to implementation a mini-batch optimization for kmeans/kmeans2 function based on a paper ("Web-Scale K-Means Clustering") and it should greatly improve the performance for large datasets. In addition, I think the support for automatically determine the number of clusters via some methods (e.g. gap statistics) can be included in this module.
As for the hierarchy module, it is rather full-featured now, but the Cython rewrite has yet begun. I'll rewrite the high level part in Cython first since it it convenient to call the original C underlying functions in Cython code. I'll migrate the underlying part from C to Cython gradually at last.
My detailed timeline is as follows.
Week 1: Do some cleanup for the existing experimental Cython version of vq (bugs, docs, etc.), unit tests, performance benchmarks for datasets of various sizes and distributions. Week 2: Finish the double precision support in the Cython version of vq, try to migrate some Python code to Cython to gain performance improvement. Week 3: Do some performance profiling, continue to optimize the performance of vq, try to replace the original C implementation with the new Cython implementation. Week 4: Implement the mini-batch K-means algorithm. Week 5: Add support for automatically determine the number of clusters. Week 6: Maneuver time. Finish the work that is behind schedule, and try some potential optimizations. Week 7: Build a framework for the Cython implementation of the hierarchy module. The work should be just translate the wrapper functions in hierarchy_wrap.c into Cython so there may be no performance gains by then. Week 8-9: Rewrite the underlying implementation of the hierarchy module in Cython. The major work is to translate hierarchy.c into Cython. Week 10: Optimize the Cython implementation of the hierarchy module, replace the original implementation if possible. Remaining time (if there is): Improve the documents, add some sample code especially for the hierarchy module.
Code Sample
My previous patches to SciPy can be found in https://github.com/scipy/scipy/pulls/richardtsai?state=closed I haven't submitted code to the cluster package but I'll probably make a related PR soon.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

Hi David, Thanks for your advice! I'll improve my proposal and pay more attention to documentation. I agree that vq module should be kept simple but high-performance so I'll focus on the optimization of it. And I'll read some materials on hierarchical clustering and find some potential improvements to it recently. Regards, Richard 2014-03-14 15:32 GMT+08:00 David Warde-Farley <d.warde.farley@gmail.com>:
Hi,
FWIW, I think this is a pretty good proposal, but I worry that some of it duplicates work that's already taken place in scikit-learn.
I think that a high-performance vq module is an important thing to have in SciPy itself (though Jake Vanderplas did some work on distance computations in Cython for scikit-learn that should be leveraged if possible, maybe Jake has thoughts on factoring it into a separate package?) and to my knowledge, the hierarchy module is not duplicated to a great extent in scikit-learn. I'd thus prioritize those two things, *including* sprucing up their documentation (SciPy is a fairly mature project, and one where documentation is, ideally, not an afterthought).
Things like mini-batch k-means and automatic determination of k are interesting but more scikit-learn territory. I would leave these things to the end, on an if-there's-time basis.
Since that _vq_rewrite was written, Cython has introduced much cleaner memoryviews. Definitely prefer those over the deprecated ndarray syntax.
Hi all, I wrote a draft proposal for my GSoC about the cluster package. I post to the list hoping for advice. However, as Ralf said, cluster is not well maintained now. And I am still not be able to find someone who know about cluster analysis to mentor me. If you have any suggestions for my
or are willing to mentor me, please let me know and I will be really grateful.
Regards, Richard
Proposal Title: SciPy: Rewrite and improve cluster package in Cython
Proposal Abstract
According to the roadmap to SciPy 1.0, the cluster package needs a Cython rewrite the make it more maintainable and efficient. Besides, there's room for improvement in cluster.vq module. Some useful features can be added and the performance can be improved when dealing with large datasets.
Proposal Detailed Description/Timeline
There's an experimental Cython implementation of the vq module in the
tree. However, it has not been maintained for about 2 years and it only supports single precision datasets, and it's also slower than the original implementation.
I plan to start with some cleanup job, then finish the double precision support. After some optimizations and tuning it should be mature enough
replace the original implementation.
After that, I'm going to implementation a mini-batch optimization for kmeans/kmeans2 function based on a paper ("Web-Scale K-Means Clustering") and it should greatly improve the performance for large datasets. In addition, I think the support for automatically determine the number of clusters via some methods (e.g. gap statistics) can be included in this module.
As for the hierarchy module, it is rather full-featured now, but the Cython rewrite has yet begun. I'll rewrite the high level part in Cython first since it it convenient to call the original C underlying functions in Cython code. I'll migrate the underlying part from C to Cython gradually at last.
My detailed timeline is as follows.
Week 1: Do some cleanup for the existing experimental Cython version of vq (bugs, docs, etc.), unit tests, performance benchmarks for datasets of various sizes and distributions. Week 2: Finish the double precision support in the Cython version of vq,
to migrate some Python code to Cython to gain performance improvement. Week 3: Do some performance profiling, continue to optimize the
of vq, try to replace the original C implementation with the new Cython implementation. Week 4: Implement the mini-batch K-means algorithm. Week 5: Add support for automatically determine the number of clusters. Week 6: Maneuver time. Finish the work that is behind schedule, and try some potential optimizations. Week 7: Build a framework for the Cython implementation of the hierarchy module. The work should be just translate the wrapper functions in hierarchy_wrap.c into Cython so there may be no performance gains by
On Thu, Mar 13, 2014 at 11:08 AM, Richard Tsai <richard9404@gmail.com> wrote: proposal, source the try performance then.
Week 8-9: Rewrite the underlying implementation of the hierarchy module in Cython. The major work is to translate hierarchy.c into Cython. Week 10: Optimize the Cython implementation of the hierarchy module, replace the original implementation if possible. Remaining time (if there is): Improve the documents, add some sample code especially for the hierarchy module.
Code Sample
My previous patches to SciPy can be found in https://github.com/scipy/scipy/pulls/richardtsai?state=closed I haven't submitted code to the cluster package but I'll probably make a related PR soon.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

On the side of hierarchical clustering, I think it would be very instructive to look at existing _software packages_ for doing hierarchical clustering rather than just the research literature. I think promoting the fact that this part of the library even exists and showing people accustomed to other tools how to use it (e.g. with IPython notebooks on the subject, demonstrating plots and analysis and so on...) would make a good complement to what you've proposed. On Fri, Mar 14, 2014 at 9:17 AM, Richard Tsai <richard9404@gmail.com> wrote:
Hi David,
Thanks for your advice! I'll improve my proposal and pay more attention to documentation. I agree that vq module should be kept simple but high-performance so I'll focus on the optimization of it. And I'll read some materials on hierarchical clustering and find some potential improvements to it recently.
Regards, Richard
2014-03-14 15:32 GMT+08:00 David Warde-Farley <d.warde.farley@gmail.com>:
Hi,
FWIW, I think this is a pretty good proposal, but I worry that some of it duplicates work that's already taken place in scikit-learn.
I think that a high-performance vq module is an important thing to have in SciPy itself (though Jake Vanderplas did some work on distance computations in Cython for scikit-learn that should be leveraged if possible, maybe Jake has thoughts on factoring it into a separate package?) and to my knowledge, the hierarchy module is not duplicated to a great extent in scikit-learn. I'd thus prioritize those two things, *including* sprucing up their documentation (SciPy is a fairly mature project, and one where documentation is, ideally, not an afterthought).
Things like mini-batch k-means and automatic determination of k are interesting but more scikit-learn territory. I would leave these things to the end, on an if-there's-time basis.
Since that _vq_rewrite was written, Cython has introduced much cleaner memoryviews. Definitely prefer those over the deprecated ndarray syntax.
On Thu, Mar 13, 2014 at 11:08 AM, Richard Tsai <richard9404@gmail.com> wrote:
Hi all, I wrote a draft proposal for my GSoC about the cluster package. I post to the list hoping for advice. However, as Ralf said, cluster is not well maintained now. And I am still not be able to find someone who know about cluster analysis to mentor me. If you have any suggestions for my proposal, or are willing to mentor me, please let me know and I will be really grateful.
Regards, Richard
Proposal Title: SciPy: Rewrite and improve cluster package in Cython
Proposal Abstract
According to the roadmap to SciPy 1.0, the cluster package needs a Cython rewrite the make it more maintainable and efficient. Besides, there's room for improvement in cluster.vq module. Some useful features can be added and the performance can be improved when dealing with large datasets.
Proposal Detailed Description/Timeline
There's an experimental Cython implementation of the vq module in the source tree. However, it has not been maintained for about 2 years and it only supports single precision datasets, and it's also slower than the original implementation.
I plan to start with some cleanup job, then finish the double precision support. After some optimizations and tuning it should be mature enough the replace the original implementation.
After that, I'm going to implementation a mini-batch optimization for kmeans/kmeans2 function based on a paper ("Web-Scale K-Means Clustering") and it should greatly improve the performance for large datasets. In addition, I think the support for automatically determine the number of clusters via some methods (e.g. gap statistics) can be included in this module.
As for the hierarchy module, it is rather full-featured now, but the Cython rewrite has yet begun. I'll rewrite the high level part in Cython first since it it convenient to call the original C underlying functions in Cython code. I'll migrate the underlying part from C to Cython gradually at last.
My detailed timeline is as follows.
Week 1: Do some cleanup for the existing experimental Cython version of vq (bugs, docs, etc.), unit tests, performance benchmarks for datasets of various sizes and distributions. Week 2: Finish the double precision support in the Cython version of vq, try to migrate some Python code to Cython to gain performance improvement. Week 3: Do some performance profiling, continue to optimize the performance of vq, try to replace the original C implementation with the new Cython implementation. Week 4: Implement the mini-batch K-means algorithm. Week 5: Add support for automatically determine the number of clusters. Week 6: Maneuver time. Finish the work that is behind schedule, and try some potential optimizations. Week 7: Build a framework for the Cython implementation of the hierarchy module. The work should be just translate the wrapper functions in hierarchy_wrap.c into Cython so there may be no performance gains by then. Week 8-9: Rewrite the underlying implementation of the hierarchy module in Cython. The major work is to translate hierarchy.c into Cython. Week 10: Optimize the Cython implementation of the hierarchy module, replace the original implementation if possible. Remaining time (if there is): Improve the documents, add some sample code especially for the hierarchy module.
Code Sample
My previous patches to SciPy can be found in https://github.com/scipy/scipy/pulls/richardtsai?state=closed I haven't submitted code to the cluster package but I'll probably make a related PR soon.
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

Hi, I looked at several ML packages and I found that ELKI has implemented a optimized single linkage algorithm called SLINK[1][2]. And I also found a similar algorithm called CLINK[3], which is for complete linkage. It seems that these two algorithms are much faster and use less memory than the naive algorithms we are using in cluster.hierarchy currently. I also read some IPython notebooks and StackOverflow posts recently and I found that many people are discussing how to plot a heatmap of hierarchical clustering. I think if we integrate it into cluster.hierarchy, it will be a good complement to hierarchy.dendrogram. Besides, I noticed that the cluster package is single-threaded currently. I don't know if parallelization in scipy level rather than BLAS level is proper, but at least we can just make use of the BLAS library (if it supports) to parallelize the kmeans algorithm. [1]: http://elki.dbs.ifi.lmu.de/releases/release0.6.0/doc/de/lmu/ifi/dbs/elki/alg... [2]: http://www.cs.ucsb.edu/~veronika/MAE/SLINK_sibson.pdf [3]: http://comjnl.oxfordjournals.org/content/20/4/364.abstract Regards, Richard

On Mon, Mar 17, 2014 at 9:31 AM, Richard Tsai <richard9404@gmail.com> wrote:
Hi,
I looked at several ML packages and I found that ELKI has implemented a optimized single linkage algorithm called SLINK[1][2]. And I also found a similar algorithm called CLINK[3], which is for complete linkage. It seems that these two algorithms are much faster and use less memory than the naive algorithms we are using in cluster.hierarchy currently.
Looks like ELKI as a mix of licenses including BSD, but the default is AGPL. Did you check for this algorithm? Not entirely clear from the above if you planned to integrate or reimplement it, for the former it matters.
I also read some IPython notebooks and StackOverflow posts recently and I found that many people are discussing how to plot a heatmap of hierarchical clustering. I think if we integrate it into cluster.hierarchy, it will be a good complement to hierarchy.dendrogram.
Assuming that it doesn't take too much time to implement (plotting shouldn't be a focus), that sounds fine. There are some more functions in several packages that optionally use MPL.
Besides, I noticed that the cluster package is single-threaded currently. I don't know if parallelization in scipy level rather than BLAS level is proper,
That has been out of scope for scipy until now.
but at least we can just make use of the BLAS library (if it supports) to parallelize the kmeans algorithm.
Are you talking about the for-loop over observations in vq()? I don't see any linalg going on in kmeans. Ralf
[1]: http://elki.dbs.ifi.lmu.de/releases/release0.6.0/doc/de/lmu/ifi/dbs/elki/alg... [2]: http://www.cs.ucsb.edu/~veronika/MAE/SLINK_sibson.pdf [3]: http://comjnl.oxfordjournals.org/content/20/4/364.abstract
Regards, Richard
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

2014-03-18 5:29 GMT+08:00 Ralf Gommers <ralf.gommers@gmail.com>:
On Mon, Mar 17, 2014 at 9:31 AM, Richard Tsai <richard9404@gmail.com>wrote:
Hi,
I looked at several ML packages and I found that ELKI has implemented a optimized single linkage algorithm called SLINK[1][2]. And I also found a similar algorithm called CLINK[3], which is for complete linkage. It seems that these two algorithms are much faster and use less memory than the naive algorithms we are using in cluster.hierarchy currently.
Looks like ELKI as a mix of licenses including BSD, but the default is AGPL. Did you check for this algorithm? Not entirely clear from the above if you planned to integrate or reimplement it, for the former it matters.
The SLINK java implementation in ELKI should be under AGPL but these two algorithms were published decades ago and should be not patented.
I also read some IPython notebooks and StackOverflow posts recently and I found that many people are discussing how to plot a heatmap of hierarchical clustering. I think if we integrate it into cluster.hierarchy, it will be a good complement to hierarchy.dendrogram.
Assuming that it doesn't take too much time to implement (plotting shouldn't be a focus), that sounds fine. There are some more functions in several packages that optionally use MPL.
Besides, I noticed that the cluster package is single-threaded currently. I don't know if parallelization in scipy level rather than BLAS level is proper,
That has been out of scope for scipy until now.
but at least we can just make use of the BLAS library (if it supports) to parallelize the kmeans algorithm.
Are you talking about the for-loop over observations in vq()? I don't see any linalg going on in kmeans.
We can expand the formula when calculating the distances then make use of some BLAS functions. I've just written a demo of it: https://gist.github.com/richardtsai/9614846 (written casually, a bit messy) It runs about 16x faster than the original C version in a 100000x500 dataset with k = 30. (Built with a thread-enabled ATLAS) In [30]: X.shape Out[30]: (100000, 500) In [31]: c.shape Out[31]: (30, 500) In [32]: %timeit _vq.vq(X, c) 1 loops, best of 3: 1.28 s per loop In [33]: %timeit _vq_rewrite.vq(X, c) 10 loops, best of 3: 79.6 ms per loop
Ralf
[1]: http://elki.dbs.ifi.lmu.de/releases/release0.6.0/doc/de/lmu/ifi/dbs/elki/alg... [2]: http://www.cs.ucsb.edu/~veronika/MAE/SLINK_sibson.pdf [3]: http://comjnl.oxfordjournals.org/content/20/4/364.abstract
Regards, Richard

On Tue, Mar 18, 2014 at 10:03 AM, Richard Tsai <richard9404@gmail.com>wrote:
2014-03-18 5:29 GMT+08:00 Ralf Gommers <ralf.gommers@gmail.com>:
On Mon, Mar 17, 2014 at 9:31 AM, Richard Tsai <richard9404@gmail.com>wrote:
Hi,
I looked at several ML packages and I found that ELKI has implemented a optimized single linkage algorithm called SLINK[1][2]. And I also found a similar algorithm called CLINK[3], which is for complete linkage. It seems that these two algorithms are much faster and use less memory than the naive algorithms we are using in cluster.hierarchy currently.
Looks like ELKI as a mix of licenses including BSD, but the default is AGPL. Did you check for this algorithm? Not entirely clear from the above if you planned to integrate or reimplement it, for the former it matters.
The SLINK java implementation in ELKI should be under AGPL but these two algorithms were published decades ago and should be not patented.
I also read some IPython notebooks and StackOverflow posts recently and I found that many people are discussing how to plot a heatmap of hierarchical clustering. I think if we integrate it into cluster.hierarchy, it will be a good complement to hierarchy.dendrogram.
Assuming that it doesn't take too much time to implement (plotting shouldn't be a focus), that sounds fine. There are some more functions in several packages that optionally use MPL.
Besides, I noticed that the cluster package is single-threaded currently. I don't know if parallelization in scipy level rather than BLAS level is proper,
That has been out of scope for scipy until now.
but at least we can just make use of the BLAS library (if it supports) to parallelize the kmeans algorithm.
Are you talking about the for-loop over observations in vq()? I don't see any linalg going on in kmeans.
We can expand the formula when calculating the distances then make use of some BLAS functions. I've just written a demo of it: https://gist.github.com/richardtsai/9614846 (written casually, a bit messy) It runs about 16x faster than the original C version in a 100000x500 dataset with k = 30. (Built with a thread-enabled ATLAS)
In [30]: X.shape Out[30]: (100000, 500)
In [31]: c.shape Out[31]: (30, 500)
In [32]: %timeit _vq.vq(X, c) 1 loops, best of 3: 1.28 s per loop
In [33]: %timeit _vq_rewrite.vq(X, c) 10 loops, best of 3: 79.6 ms per loop
OK clear. That's a pretty decent speed-up:) Ralf
Ralf
[1]: http://elki.dbs.ifi.lmu.de/releases/release0.6.0/doc/de/lmu/ifi/dbs/elki/alg... [2]: http://www.cs.ucsb.edu/~veronika/MAE/SLINK_sibson.pdf [3]: http://comjnl.oxfordjournals.org/content/20/4/364.abstract
Regards, Richard
_______________________________________________ SciPy-Dev mailing list SciPy-Dev@scipy.org http://mail.scipy.org/mailman/listinfo/scipy-dev

Hi all, I've posted my proposal to melange but there's still some potential features to the package (cluster) I want to discuss here. The first one is about the stopping criterion of kmeans/kmeans. These two functions are using the average distance from observations to their corresponding centroids currently. But a more accurate exiting condition will be the average *squared* distance. Besides, the average centroids moving distance, and the changes of the results of vq are both better than the original one. Second, finding convex hulls of hierarchical clustering seems interesting but I'm not sure if there's a demand for it. The third one is gap statistics for automatic determination of k in kmeans. David supposed that it should be scikit-learn territory and I plan to put it to the end. I'm not sure if these features are proper to be integrated into cluster and Ralf doubts that there's some overlap with scikit-learn so I post them here to discuss at his suggestion. I've also made my proposal public: http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/richardts... Comments/suggestions are welcome. Regards, Richard

2014-03-21 22:18 GMT+08:00 Richard Tsai <richard9404@gmail.com>:
Hi all,
I've posted my proposal to melange but there's still some potential features to the package (cluster) I want to discuss here.
The first one is about the stopping criterion of kmeans/kmeans. These two functions are using the average distance from observations to their corresponding centroids currently. But a more accurate exiting condition will be the average *squared* distance. Besides, the average centroids moving distance, and the changes of the results of vq are both better than the original one. Second, finding convex hulls of hierarchical clustering seems interesting but I'm not sure if there's a demand for it. The third one is gap statistics for automatic determination of k in kmeans. David supposed that it should be scikit-learn territory and I plan to put it to the end.
I'm not sure if these features are proper to be integrated into cluster and Ralf doubts that there's some overlap with scikit-learn so I post them here to discuss at his suggestion. I've also made my proposal public: http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/richardts... Comments/suggestions are welcome.
Regards, Richard
Hi all, I've received emails from GSoC saying that my proposal has been accepted. Thanks to those who have help me with my application! I'll submit the required materials soon then make a more detailed plan and prepare for coding. If you have any thoughts about my project, please discuss with me! Richard

On 4/22/14, Richard Tsai <richard9404@gmail.com> wrote:
2014-03-21 22:18 GMT+08:00 Richard Tsai <richard9404@gmail.com>:
Hi all,
I've posted my proposal to melange but there's still some potential features to the package (cluster) I want to discuss here.
The first one is about the stopping criterion of kmeans/kmeans. These two functions are using the average distance from observations to their corresponding centroids currently. But a more accurate exiting condition will be the average *squared* distance. Besides, the average centroids moving distance, and the changes of the results of vq are both better than the original one. Second, finding convex hulls of hierarchical clustering seems interesting but I'm not sure if there's a demand for it. The third one is gap statistics for automatic determination of k in kmeans. David supposed that it should be scikit-learn territory and I plan to put it to the end.
I'm not sure if these features are proper to be integrated into cluster and Ralf doubts that there's some overlap with scikit-learn so I post them here to discuss at his suggestion. I've also made my proposal public: http://www.google-melange.com/gsoc/proposal/public/google/gsoc2014/richardts... Comments/suggestions are welcome.
Regards, Richard
Hi all,
I've received emails from GSoC saying that my proposal has been accepted. Thanks to those who have help me with my application!
I'll submit the required materials soon then make a more detailed plan and prepare for coding. If you have any thoughts about my project, please discuss with me!
Richard
Congratulations, Richard! That's great news. Warren
participants (4)
-
David Warde-Farley
-
Ralf Gommers
-
Richard Tsai
-
Warren Weckesser