[scikit-learn] Construct the microclusters using a CF-Tree
Sema Atasever
s.atasever at gmail.com
Wed Jul 5 06:27:58 EDT 2017
Hi Roman,
I reduced my original data set with feature selection, it has
now n_samples=10467, n_features=23.
I tried clustering with Birch algorithm this time it worked.
I obtained 35 clusters for the reduced dataset in the attachment(data2.dat).
How can i know which cluster member represents best each cluster?
For example Cluster 0 has 5 member which are : 1, 2, 3, 28 and 29. rows in
the data set.
Which cluster member (1, 2, 3, 28 or 29) represents best Cluster 0 ?
In the birch code i use this code line: *centroids =
brc.subcluster_centers_*
How do I interpret this line of code output?
Thank you so much for your help.
*Birch Code:*
from sklearn.cluster import Birch
from io import StringIO
import numpy as np
X=np.loadtxt(open("C:\data2.dat", "rb"), delimiter=",")
brc = Birch(branching_factor=50, n_clusters=None,
threshold=0.5,compute_labels=True,copy=True)
brc.fit(X)
centroids = brc.subcluster_centers_
labels = brc.subcluster_labels_
brc.predict(X)
print("\n brc.predict(X)")
print(brc.predict(X))
print("\n centroids")
print(centroids)
print("\n labels")
print(labels)
On Mon, Jul 3, 2017 at 11:46 PM, Roman Yurchak <rth.yurchak at gmail.com>
wrote:
> Hello Sema,
>
> as far as I can tell, in your dataset you has n_samples=65909,
> n_features=539. Clustering high dimensional data is problematic for a
> number of reasons, https://en.wikipedia.org/wiki/
> Clustering_high-dimensional_data#Problems
>
> besides the BIRCH implementation doesn't scale well for n_features >> 50
> (see for instance the discussion in the second part of
> https://github.com/scikit-learn/scikit-learn/pull/8808#issue
> comment-300776216 also in ).
>
> As a workaround for the memory error, you could try using the out-of-core
> version of Birch (using `partial_fit` on chunks of the dataset, instead of
> `fit`) but in any case it might also be better to reduce dimensionality
> beforehand (e.g. with PCA), if that's acceptable. Also the threshold
> parameter may need to be increased: since in your dataset it looks like the
> Euclidean distances are more in the 1-10 range?
>
> --
> Roman
>
>
> On 03/07/17 17:09, Sema Atasever wrote:
>
>> Dear Roman,
>>
>> When I try the code with the original data (*data.dat*) as you
>> suggested, I get the following error : *Memory Error* --> (*error.png*),
>> how can i overcome this problem, thank you so much in advance.
>>
>> data.dat
>> <https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1
>> k/view?usp=drive_web>
>>
>>
>> On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak <rth.yurchak at gmail.com
>> <mailto:rth.yurchak at gmail.com>> wrote:
>>
>> Hello Sema,
>>
>> On 30/06/17 17:14, Sema Atasever wrote:
>>
>> I want to cluster them using Birch clustering algorithm.
>> Does this method have 'precomputed' option.
>>
>>
>> No it doesn't, see
>> http://scikit-learn.org/stable/modules/generated/sklearn.
>> cluster.Birch.html
>> <http://scikit-learn.org/stable/modules/generated/sklearn.
>> cluster.Birch.html>
>> so you would need to provide it with the original features matrix
>> (not the precomputed distance matrix). Since your dataset is fairly
>> small, there is no reason in precomputing it anyway.
>>
>> I needed train an SVM on the centroids of the microclusters so
>> *How can i get the centroids of the microclusters?*
>>
>>
>> By "microclusters" do you mean sub-clusters? If you are interested
>> in the leaves subclusters see the Birch.subcluster_centers_ parameter.
>>
>> Otherwise if you want all the centroids in the hierarchy of
>> subclusters, you can browse the hierarchical tree via the
>> Birch.root_ attribute then look at _CFSubcluster.centroid_ for each
>> subcluster.
>>
>> Hope this helps,
>> --
>> Roman
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn at python.org <mailto:scikit-learn at python.org>
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> <https://mail.python.org/mailman/listinfo/scikit-learn>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170705/047fd1ed/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: screen_shot.png
Type: image/png
Size: 103493 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170705/047fd1ed/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: data2.dat
Type: application/octet-stream
Size: 18776 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20170705/047fd1ed/attachment-0001.obj>
More information about the scikit-learn
mailing list