<div dir="ltr">Hi Roman,<div><br></div><div>I reduced my original data set with feature selection, it has now n_samples=10467, n_features=23.<br></div><div><br></div><div>I tried clustering with Birch algorithm this time it worked. </div><div>I obtained 35 clusters for the reduced dataset in the attachment(data2.dat).</div><div><br></div><div>How can i know which cluster member represents best each cluster?</div><div><br></div><div>For example Cluster 0 has 5 member which are : 1, 2, 3, 28 and 29. rows in the data set.</div><div><br></div><div>Which cluster member (1, 2, 3, 28 or 29) represents best Cluster 0 ?</div><div><br></div><div>In the birch code i use this code line: <b>centroids = brc.subcluster_centers_</b></div><div><b><br></b></div><div>How do I interpret this line of code output?<br></div><div><br></div><div>Thank you so much for your help.</div><div><br></div><div><b>Birch Code:</b></div><div><div>from sklearn.cluster import Birch</div><div>from io import StringIO</div><div>import numpy as np</div><div><br></div><div>X=np.loadtxt(open("C:\data2.dat", "rb"), delimiter=",")</div><div><br></div><div>brc = Birch(branching_factor=50, n_clusters=None, threshold=0.5,compute_labels=True,copy=True)</div><div><br></div><div>brc.fit(X)</div><div><br></div><div>centroids = brc.subcluster_centers_</div><div>labels = brc.subcluster_labels_ </div><div><br></div><div><br></div><div>brc.predict(X)</div><div><br></div><div>print("\n brc.predict(X)")</div><div>print(brc.predict(X))</div><div><br></div><div>print("\n centroids")</div><div>print(centroids)</div><div><br></div><div>print("\n labels")</div><div>print(labels)</div></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jul 3, 2017 at 11:46 PM, Roman Yurchak <span dir="ltr"><<a href="mailto:rth.yurchak@gmail.com" target="_blank">rth.yurchak@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello Sema,<br>
<br>
as far as I can tell, in your dataset you has n_samples=65909, n_features=539. Clustering high dimensional data is problematic for a number of reasons, <a href="https://en.wikipedia.org/wiki/Clustering_high-dimensional_data#Problems" rel="noreferrer" target="_blank">https://en.wikipedia.org/wiki/<wbr>Clustering_high-dimensional_da<wbr>ta#Problems</a><br>
<br>
besides the BIRCH implementation doesn't scale well for n_features >> 50 (see for instance the discussion in the second part of <a href="https://github.com/scikit-learn/scikit-learn/pull/8808#issuecomment-300776216" rel="noreferrer" target="_blank">https://github.com/scikit-lear<wbr>n/scikit-learn/pull/8808#issue<wbr>comment-300776216</a> also in ).<br>
<br>
As a workaround for the memory error, you could try using the out-of-core version of Birch (using `partial_fit` on chunks of the dataset, instead of `fit`) but in any case it might also be better to reduce dimensionality beforehand (e.g. with PCA), if that's acceptable. Also the threshold parameter may need to be increased: since in your dataset it looks like the Euclidean distances are more in the 1-10 range?<br>
<br>
-- <br>
Roman<span class=""><br>
<br>
<br>
On 03/07/17 17:09, Sema Atasever wrote:<br>
</span><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Dear Roman,<br>
<br>
When I try the code with the original data (*data.dat*) as you<br>
suggested, I get the following error : *Memory Error* --> (*error.png*),<span class=""><br>
how can i overcome this problem, thank you so much in advance.<br>
<br></span>
data.dat<br>
<<a href="https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_web" rel="noreferrer" target="_blank">https://drive.google.com/file<wbr>/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1<wbr>k/view?usp=drive_web</a>><span class=""><br>
<br>
<br>
On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <<a href="mailto:rth.yurchak@gmail.com" target="_blank">rth.yurchak@gmail.com</a><br></span><div><div class="h5">
<mailto:<a href="mailto:rth.yurchak@gmail.com" target="_blank">rth.yurchak@gmail.com</a>><wbr>> wrote:<br>
<br>
Hello Sema,<br>
<br>
On 30/06/17 17:14, Sema Atasever wrote:<br>
<br>
I want to cluster them using Birch clustering algorithm.<br>
Does this method have 'precomputed' option.<br>
<br>
<br>
No it doesn't, see<br>
<a href="http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html" rel="noreferrer" target="_blank">http://scikit-learn.org/stable<wbr>/modules/generated/sklearn.<wbr>cluster.Birch.html</a><br>
<<a href="http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html" rel="noreferrer" target="_blank">http://scikit-learn.org/stabl<wbr>e/modules/generated/sklearn.<wbr>cluster.Birch.html</a>><br>
so you would need to provide it with the original features matrix<br>
(not the precomputed distance matrix). Since your dataset is fairly<br>
small, there is no reason in precomputing it anyway.<br>
<br>
I needed train an SVM on the centroids of the microclusters so<br>
*How can i get the centroids of the microclusters?*<br>
<br>
<br>
By "microclusters" do you mean sub-clusters? If you are interested<br>
in the leaves subclusters see the Birch.subcluster_centers_ parameter.<br>
<br>
Otherwise if you want all the centroids in the hierarchy of<br>
subclusters, you can browse the hierarchical tree via the<br>
Birch.root_ attribute then look at _CFSubcluster.centroid_ for each<br>
subcluster.<br>
<br>
Hope this helps,<br>
--<br>
Roman<br>
______________________________<wbr>_________________<br>
scikit-learn mailing list<br></div></div>
<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.org</a> <mailto:<a href="mailto:scikit-learn@python.org" target="_blank">scikit-learn@python.or<wbr>g</a>><br>
<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailma<wbr>n/listinfo/scikit-learn</a><br>
<<a href="https://mail.python.org/mailman/listinfo/scikit-learn" rel="noreferrer" target="_blank">https://mail.python.org/mailm<wbr>an/listinfo/scikit-learn</a>><br>
<br>
<br>
</blockquote>
<br>
</blockquote></div><br></div>