[scikit-learn] Construct the microclusters using a CF-Tree

Roman Yurchak rth.yurchak at gmail.com
Mon Jul 3 16:46:03 EDT 2017


Hello Sema,

as far as I can tell, in your dataset you has n_samples=65909, 
n_features=539. Clustering high dimensional data is problematic for a 
number of reasons, 
https://en.wikipedia.org/wiki/Clustering_high-dimensional_data#Problems

besides the BIRCH implementation doesn't scale well for n_features >> 50 
(see for instance the discussion in the second part of 
https://github.com/scikit-learn/scikit-learn/pull/8808#issuecomment-300776216 
also in ).

As a workaround for the memory error, you could try using the 
out-of-core version of Birch (using `partial_fit` on chunks of the 
dataset, instead of `fit`) but in any case it might also be better to 
reduce dimensionality beforehand (e.g. with PCA), if that's acceptable. 
Also the threshold parameter may need to be increased: since in your 
dataset it looks like the Euclidean distances are more in the 1-10 range?

-- 
Roman


On 03/07/17 17:09, Sema Atasever wrote:
> Dear Roman,
>
> When I try the code with the original data (*data.dat*) as you
> suggested, I get the following error : *Memory Error* --> (*error.png*),
> how can i overcome this problem, thank you so much in advance.
>>  data.dat
> <https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_web>
>>
> On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak at gmail.com
> <mailto:rth.yurchak at gmail.com>> wrote:
>
>     Hello Sema,
>
>     On 30/06/17 17:14, Sema Atasever wrote:
>
>         I want to cluster them using Birch clustering algorithm.
>         Does this method have 'precomputed' option.
>
>
>     No it doesn't, see
>     http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html
>     <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html>
>     so you would need to provide it with the original features matrix
>     (not the precomputed distance matrix). Since your dataset is fairly
>     small, there is no reason in precomputing it anyway.
>
>         I needed train an SVM on the centroids of the microclusters so
>         *How can i get the centroids of the microclusters?*
>
>
>     By "microclusters" do you mean sub-clusters? If you are interested
>     in the leaves subclusters see the Birch.subcluster_centers_ parameter.
>
>     Otherwise if you want all the centroids in the hierarchy of
>     subclusters, you can browse the hierarchical tree via the
>     Birch.root_ attribute then look at _CFSubcluster.centroid_ for each
>     subcluster.
>
>     Hope this helps,
>     --
>     Roman
>     _______________________________________________
>     scikit-learn mailing list
>     scikit-learn at python.org <mailto:scikit-learn at python.org>
>     https://mail.python.org/mailman/listinfo/scikit-learn
>     <https://mail.python.org/mailman/listinfo/scikit-learn>
>
>



More information about the scikit-learn mailing list