[scikit-learn] Large computation time for homogeneous data with agglomerative clustering

Mon Sep 26 15:43:05 EDT 2016

Dear Scikit-learners,
This is my first post here and I hope you experts can help me a lot.

We are using the agglomerative clustering with ward's linkage and
connectivity constraint. The data size is around 205,000 (each is a single
scalar feature). The data set is dynamic (in time) and we need to apply
clustering at different time thorough the process. Initially all data is 0
and they increase gradually. Alternatively, in the early stage the data is
more homogeneous and the heterogeneity among the data increases gradually.
If the clustering is applied at the final stage (most heterogeneous data,
but off course having patterns/clusters) requesting 20 clusters it takes
only 61s of CPU time. But, if clustering is run in an early stage (more
homogeneous data but all are not 0 and off course there are
patterns/clusters in the data) with the same settings the time rises up to
1h 5m. The CPU time is in-between of these two if the data come from an
in-between time stamp. I also tried the the other linkage options too, but
the situation does not improve. My understanding is that the homogeneity is
playing the role.

Have you experienced this too? What solution do you suggest?

Thanks in advance for your attention and help.

-- 
Best regards

Md. Khairullah
PhD Student, KU Leuven
Numerical Analysis and Applied Mathematics Section
Celestijnenlaan 200a - box 2402
3001 Leuven
room: 03.18
tel. +32 16 37 39 66
fax +32 16 3 27996
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20160926/da13ef50/attachment-0001.html>