Hi Roman, I reduced my original data set with feature selection, it has now n_samples=10467, n_features=23. I tried clustering with Birch algorithm this time it worked. I obtained 35 clusters for the reduced dataset in the attachment(data2.dat). How can i know which cluster member represents best each cluster? For example Cluster 0 has 5 member which are : 1, 2, 3, 28 and 29. rows in the data set. Which cluster member (1, 2, 3, 28 or 29) represents best Cluster 0 ? In the birch code i use this code line: *centroids = brc.subcluster_centers_* How do I interpret this line of code output? Thank you so much for your help. *Birch Code:* from sklearn.cluster import Birch from io import StringIO import numpy as np X=np.loadtxt(open("C:\data2.dat", "rb"), delimiter=",") brc = Birch(branching_factor=50, n_clusters=None, threshold=0.5,compute_labels=True,copy=True) brc.fit(X) centroids = brc.subcluster_centers_ labels = brc.subcluster_labels_ brc.predict(X) print("\n brc.predict(X)") print(brc.predict(X)) print("\n centroids") print(centroids) print("\n labels") print(labels) On Mon, Jul 3, 2017 at 11:46 PM, Roman Yurchak <rth.yurchak@gmail.com> wrote:
Hello Sema,
as far as I can tell, in your dataset you has n_samples=65909, n_features=539. Clustering high dimensional data is problematic for a number of reasons, https://en.wikipedia.org/wiki/ Clustering_high-dimensional_data#Problems
besides the BIRCH implementation doesn't scale well for n_features >> 50 (see for instance the discussion in the second part of https://github.com/scikit-learn/scikit-learn/pull/8808#issue comment-300776216 also in ).
As a workaround for the memory error, you could try using the out-of-core version of Birch (using `partial_fit` on chunks of the dataset, instead of `fit`) but in any case it might also be better to reduce dimensionality beforehand (e.g. with PCA), if that's acceptable. Also the threshold parameter may need to be increased: since in your dataset it looks like the Euclidean distances are more in the 1-10 range?
-- Roman
On 03/07/17 17:09, Sema Atasever wrote:
Dear Roman,
When I try the code with the original data (*data.dat*) as you suggested, I get the following error : *Memory Error* --> (*error.png*), how can i overcome this problem, thank you so much in advance. data.dat <https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1 k/view?usp=drive_web>
On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak@gmail.com <mailto:rth.yurchak@gmail.com>> wrote:
Hello Sema,
On 30/06/17 17:14, Sema Atasever wrote:
I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option.
No it doesn't, see http://scikit-learn.org/stable/modules/generated/sklearn. cluster.Birch.html <http://scikit-learn.org/stable/modules/generated/sklearn. cluster.Birch.html> so you would need to provide it with the original features matrix (not the precomputed distance matrix). Since your dataset is fairly small, there is no reason in precomputing it anyway.
I needed train an SVM on the centroids of the microclusters so *How can i get the centroids of the microclusters?*
By "microclusters" do you mean sub-clusters? If you are interested in the leaves subclusters see the Birch.subcluster_centers_ parameter.
Otherwise if you want all the centroids in the hierarchy of subclusters, you can browse the hierarchical tree via the Birch.root_ attribute then look at _CFSubcluster.centroid_ for each subcluster.
Hope this helps, -- Roman _______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>