Issues with kmeans: Difference in centroid values
Hi everyone, I was using scikit-learn KMeans algorithm to cluster pretrained word-vectors. There are a few things which I found to be surprising and wanted to get some feedback on. - Based upon the 'labels_' assigned to each word-vector (i.e. cluster memberships), I compute every cluster centroid as the average of the word-vectors (corresponding to that cluster). Surprisingly, this seems to be pretty different from the 'cluster_centers_'. Is there anything that I am missing here? - I was later using the verbose option to see if the clustering has converged or not. I saw on the console log messages such as *"**center shift 7.994126e-04 within tolerance 1.243425e-06"*. It seems that this corresponds to some code in *kmeans_elkan.pyx* ( https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_k_... ). - Lastly, another thing that seems strange is that I hadn't set the tolerance value. So the default of 1e-4 should have been used. But if you look again at the above log, it says *within tolerance 1.243425e-06 instead of 1e-4. * It would be great if you can look into this and help me out. Thank you so much! :) Best, Sidak Pal Singh EPFL
Hi everyone,
I was using scikit-learn KMeans algorithm to cluster pretrained word-vectors. There are a few things which I found to be surprising and wanted to get some feedback on.
- Based upon the 'labels_' assigned to each word-vector (i.e. cluster memberships), I compute every cluster centroid as the average of the word-vectors (corresponding to that cluster). Surprisingly, this seems to be pretty different from the 'cluster_centers_'. Is there anything that I am missing here? If the algorithm did not fully converge, you just did one more step, so
On 04/16/2018 04:07 PM, Sidak Pal Singh wrote: the results are expected to be different.
- I was later using the verbose option to see if the clustering has converged or not. I saw on the console log messages such as /"//center shift 7.994126e-04 within tolerance 1.243425e-06"/. It seems that this corresponds to some code in *kmeans_elkan.pyx* (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/_k_...).
- Lastly, another thing that seems strange is that I hadn't set the tolerance value. So the default of 1e-4 should have been used. But if you look again at the above log, it says /within tolerance 1.243425e-06 instead of 1e-4. /
/https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_m... The tolerance is scaled by the variance of the data to be independent of the scal/e
participants (2)
-
Andreas Mueller -
Sidak Pal Singh