Hi David,

Thanks for your advice! I'll improve my proposal and pay more attention to documentation. I agree that vq module should be kept simple but high-performance so I'll focus on the optimization of it. And I'll read some materials on hierarchical clustering and find some potential improvements to it recently.

Regards,

Richard

2014-03-14 15:32 GMT+08:00 David Warde-Farley <d.warde.farley@gmail.com>:

Hi,

FWIW, I think this is a pretty good proposal, but I worry that some of
it duplicates work that's already taken place in scikit-learn.

I think that a high-performance vq module is an important thing to
have in SciPy itself (though Jake Vanderplas did some work on distance
computations in Cython for scikit-learn that should be leveraged if
possible, maybe Jake has thoughts on factoring it into a separate
package?) and to my knowledge, the hierarchy module is not duplicated
to a great extent in scikit-learn. I'd thus prioritize those two
things, *including* sprucing up their documentation (SciPy is a fairly
mature project, and one where documentation is, ideally, not an
afterthought).

Things like mini-batch k-means and automatic determination of k are
interesting but more scikit-learn territory. I would leave these
things to the end, on an if-there's-time basis.

Since that _vq_rewrite was written, Cython has introduced much cleaner
memoryviews. Definitely prefer those over the deprecated ndarray
syntax.

On Thu, Mar 13, 2014 at 11:08 AM, Richard Tsai <richard9404@gmail.com> wrote:
> Hi all,
> I wrote a draft proposal for my GSoC about the cluster package. I post to
> the list hoping for advice. However, as Ralf said, cluster is not well
> maintained now. And I am still not be able to find someone who know about
> cluster analysis to mentor me. If you have any suggestions for my proposal,
> or are willing to mentor me, please let me know and I will be really
> grateful.
>
> Regards,
> Richard
>
> Proposal Title: SciPy: Rewrite and improve cluster package in Cython
>
> Proposal Abstract
>
> According to the roadmap to SciPy 1.0, the cluster package needs a Cython
> rewrite the make it more maintainable and efficient. Besides, there's room
> for improvement in cluster.vq module. Some useful features can be added and
> the performance can be improved when dealing with large datasets.
>
> Proposal Detailed Description/Timeline
>
> There's an experimental Cython implementation of the vq module in the source
> tree. However, it has not been maintained for about 2 years and it only
> supports single precision datasets, and it's also slower than the original
> implementation.
>
> I plan to start with some cleanup job, then finish the double precision
> support. After some optimizations and tuning it should be mature enough the
> replace the original implementation.
>
> After that, I'm going to implementation a mini-batch optimization for
> kmeans/kmeans2 function based on a paper ("Web-Scale K-Means Clustering")
> and it should greatly improve the performance for large datasets. In
> addition, I think the support for automatically determine the number of
> clusters via some methods (e.g. gap statistics) can be included in this
> module.
>
> As for the hierarchy module, it is rather full-featured now, but the Cython
> rewrite has yet begun. I'll rewrite the high level part in Cython first
> since it it convenient to call the original C underlying functions in Cython
> code. I'll migrate the underlying part from C to Cython gradually at last.
>
> My detailed timeline is as follows.
>
> Week 1: Do some cleanup for the existing experimental Cython version of vq
> (bugs, docs, etc.), unit tests, performance benchmarks for datasets of
> various sizes and distributions.
> Week 2: Finish the double precision support in the Cython version of vq, try
> to migrate some Python code to Cython to gain performance improvement.
> Week 3: Do some performance profiling, continue to optimize the performance
> of vq, try to replace the original C implementation with the new Cython
> implementation.
> Week 4: Implement the mini-batch K-means algorithm.
> Week 5: Add support for automatically determine the number of clusters.
> Week 6: Maneuver time. Finish the work that is behind schedule, and try some
> potential optimizations.
> Week 7: Build a framework for the Cython implementation of the hierarchy
> module. The work should be just translate the wrapper functions in
> hierarchy_wrap.c into Cython so there may be no performance gains by then.
> Week 8-9: Rewrite the underlying implementation of the hierarchy module in
> Cython. The major work is to translate hierarchy.c into Cython.
> Week 10: Optimize the Cython implementation of the hierarchy module, replace
> the original implementation if possible.
> Remaining time (if there is): Improve the documents, add some sample code
> especially for the hierarchy module.
>
> Code Sample
>
> My previous patches to SciPy can be found in
> https://github.com/scipy/scipy/pulls/richardtsai?state=closed
> I haven't submitted code to the cluster package but I'll probably make a
> related PR soon.
>
>

> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
_______________________________________________
SciPy-Dev mailing list
SciPy-Dev@scipy.org
http://mail.scipy.org/mailman/listinfo/scipy-dev