Construct the microclusters using a CF-Tree
Hi all, I want to ask you about clustering usign Birch clustering algorithm. I have a *distance matrix* n*n M where M_ij is the distance between object_i and object_j.(You can see file format in the attachment). I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option. I needed train an SVM on the centroids of the microclusters so *How can i get the centroids of the microclusters?* Any help would be highly appreciated. *Birch code:* from sklearn.cluster import Birch from io import StringIO import numpy as np X=np.loadtxt(open("C:\dm.txt", "rb"), delimiter="\t") brc = Birch(branching_factor=50, n_clusters=3, threshold=0.5,compute_labels=True,copy=True) brc.fit(X) brc.predict(X) print(brc.predict(X)) Any help would be highly appreciated.
Hello Sema, On 30/06/17 17:14, Sema Atasever wrote:
I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option.
No it doesn't, see http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html so you would need to provide it with the original features matrix (not the precomputed distance matrix). Since your dataset is fairly small, there is no reason in precomputing it anyway.
I needed train an SVM on the centroids of the microclusters so *How can i get the centroids of the microclusters?*
By "microclusters" do you mean sub-clusters? If you are interested in the leaves subclusters see the Birch.subcluster_centers_ parameter. Otherwise if you want all the centroids in the hierarchy of subclusters, you can browse the hierarchical tree via the Birch.root_ attribute then look at _CFSubcluster.centroid_ for each subcluster. Hope this helps, -- Roman
Dear Roman, When I try the code with the original data (*data.dat*) as you suggested, I get the following error : *Memory Error* --> (*error.png*), how can i overcome this problem, thank you so much in advance. data.dat <https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_...> On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak@gmail.com> wrote:
Hello Sema,
On 30/06/17 17:14, Sema Atasever wrote:
I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option.
No it doesn't, see http://scikit-learn.org/stable /modules/generated/sklearn.cluster.Birch.html so you would need to provide it with the original features matrix (not the precomputed distance matrix). Since your dataset is fairly small, there is no reason in precomputing it anyway.
I needed train an SVM on the centroids of the microclusters so
*How can i get the centroids of the microclusters?*
By "microclusters" do you mean sub-clusters? If you are interested in the leaves subclusters see the Birch.subcluster_centers_ parameter.
Otherwise if you want all the centroids in the hierarchy of subclusters, you can browse the hierarchical tree via the Birch.root_ attribute then look at _CFSubcluster.centroid_ for each subcluster.
Hope this helps, -- Roman _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Hello Sema, as far as I can tell, in your dataset you has n_samples=65909, n_features=539. Clustering high dimensional data is problematic for a number of reasons, https://en.wikipedia.org/wiki/Clustering_high-dimensional_data#Problems besides the BIRCH implementation doesn't scale well for n_features >> 50 (see for instance the discussion in the second part of https://github.com/scikit-learn/scikit-learn/pull/8808#issuecomment-30077621... also in ). As a workaround for the memory error, you could try using the out-of-core version of Birch (using `partial_fit` on chunks of the dataset, instead of `fit`) but in any case it might also be better to reduce dimensionality beforehand (e.g. with PCA), if that's acceptable. Also the threshold parameter may need to be increased: since in your dataset it looks like the Euclidean distances are more in the 1-10 range? -- Roman On 03/07/17 17:09, Sema Atasever wrote:
Dear Roman,
When I try the code with the original data (*data.dat*) as you suggested, I get the following error : *Memory Error* --> (*error.png*), how can i overcome this problem, thank you so much in advance. data.dat <https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_...>
On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak@gmail.com <mailto:rth.yurchak@gmail.com>> wrote:
Hello Sema,
On 30/06/17 17:14, Sema Atasever wrote:
I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option.
No it doesn't, see http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html <http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html> so you would need to provide it with the original features matrix (not the precomputed distance matrix). Since your dataset is fairly small, there is no reason in precomputing it anyway.
I needed train an SVM on the centroids of the microclusters so *How can i get the centroids of the microclusters?*
By "microclusters" do you mean sub-clusters? If you are interested in the leaves subclusters see the Birch.subcluster_centers_ parameter.
Otherwise if you want all the centroids in the hierarchy of subclusters, you can browse the hierarchical tree via the Birch.root_ attribute then look at _CFSubcluster.centroid_ for each subcluster.
Hope this helps, -- Roman _______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
Hi Roman, I reduced my original data set with feature selection, it has now n_samples=10467, n_features=23. I tried clustering with Birch algorithm this time it worked. I obtained 35 clusters for the reduced dataset in the attachment(data2.dat). How can i know which cluster member represents best each cluster? For example Cluster 0 has 5 member which are : 1, 2, 3, 28 and 29. rows in the data set. Which cluster member (1, 2, 3, 28 or 29) represents best Cluster 0 ? In the birch code i use this code line: *centroids = brc.subcluster_centers_* How do I interpret this line of code output? Thank you so much for your help. *Birch Code:* from sklearn.cluster import Birch from io import StringIO import numpy as np X=np.loadtxt(open("C:\data2.dat", "rb"), delimiter=",") brc = Birch(branching_factor=50, n_clusters=None, threshold=0.5,compute_labels=True,copy=True) brc.fit(X) centroids = brc.subcluster_centers_ labels = brc.subcluster_labels_ brc.predict(X) print("\n brc.predict(X)") print(brc.predict(X)) print("\n centroids") print(centroids) print("\n labels") print(labels) On Mon, Jul 3, 2017 at 11:46 PM, Roman Yurchak <rth.yurchak@gmail.com> wrote:
Hello Sema,
as far as I can tell, in your dataset you has n_samples=65909, n_features=539. Clustering high dimensional data is problematic for a number of reasons, https://en.wikipedia.org/wiki/ Clustering_high-dimensional_data#Problems
besides the BIRCH implementation doesn't scale well for n_features >> 50 (see for instance the discussion in the second part of https://github.com/scikit-learn/scikit-learn/pull/8808#issue comment-300776216 also in ).
As a workaround for the memory error, you could try using the out-of-core version of Birch (using `partial_fit` on chunks of the dataset, instead of `fit`) but in any case it might also be better to reduce dimensionality beforehand (e.g. with PCA), if that's acceptable. Also the threshold parameter may need to be increased: since in your dataset it looks like the Euclidean distances are more in the 1-10 range?
-- Roman
On 03/07/17 17:09, Sema Atasever wrote:
Dear Roman,
When I try the code with the original data (*data.dat*) as you suggested, I get the following error : *Memory Error* --> (*error.png*), how can i overcome this problem, thank you so much in advance. data.dat <https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1 k/view?usp=drive_web>
On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak@gmail.com <mailto:rth.yurchak@gmail.com>> wrote:
Hello Sema,
On 30/06/17 17:14, Sema Atasever wrote:
I want to cluster them using Birch clustering algorithm. Does this method have 'precomputed' option.
No it doesn't, see http://scikit-learn.org/stable/modules/generated/sklearn. cluster.Birch.html <http://scikit-learn.org/stable/modules/generated/sklearn. cluster.Birch.html> so you would need to provide it with the original features matrix (not the precomputed distance matrix). Since your dataset is fairly small, there is no reason in precomputing it anyway.
I needed train an SVM on the centroids of the microclusters so *How can i get the centroids of the microclusters?*
By "microclusters" do you mean sub-clusters? If you are interested in the leaves subclusters see the Birch.subcluster_centers_ parameter.
Otherwise if you want all the centroids in the hierarchy of subclusters, you can browse the hierarchical tree via the Birch.root_ attribute then look at _CFSubcluster.centroid_ for each subcluster.
Hope this helps, -- Roman _______________________________________________ scikit-learn mailing list scikit-learn@python.org <mailto:scikit-learn@python.org> https://mail.python.org/mailman/listinfo/scikit-learn <https://mail.python.org/mailman/listinfo/scikit-learn>
Hello Sema, On 05/07/17 13:27, Sema Atasever wrote:
How can i know which cluster member represents best each cluster?
You could try to pick the one that's closest to the cluster centroid..
In the birch code i use this code line: *centroids = brc.subcluster_centers_* How do I interpret this line of code output?
It is supposed to give your the centroid of each leaf node (computed in https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/cluster/bi...). I would just recompute the centroid from the labels, though, with X[brc.labels_==k, :].mean() for k in np.unique(brc.labels_) to be sure of the results... -- Roman
účastníci (2)
-
Roman Yurchak -
Sema Atasever