[SciPy-User] kmeans

Fri Jul 23 18:27:33 EDT 2010

On Jul 23, 2010, at 2:55 PM, Benjamin Root wrote:
> On Fri, Jul 23, 2010 at 4:18 PM, Lutz Maibaum <lutz.maibaum at gmail.com> wrote:
>> Actually, it not entirely clear to me anymore what the bug is. According to the k-means Wikipedia page, the objective function that the algorithm tries to minimize is the total intra-cluster variance (the sum of squares of distances of data points from cluster centroids). However, the two steps of the iteration (assignment to centroids, and centroid update) use regular distances and means. Is this not what the current code is doing?
> 
> Which is why I have been saying that there is no bug here because the code is technically correct.  A mean of regular distances is a sum of squared distances that has been divided.  The only reason why the current code is not returning the correct answer for the given example is that it never tries 3 as a centroid value.  This is a different issue.

I apologize if I am being obtuse, but why do you think the current code does not return the correct answer?

>>> import numpy as np
>>> from scipy import cluster
>>> v = np.array([1,2,3,4,10],dtype=float)
>>> cluster.vq.kmeans(v, 1)
(array([ 4.]), 2.3999999999999999)
>>> np.sum([abs(x-4)**2 for x in v])
50.0
>>> np.sum([abs(x-3)**2 for x in v])
55.0

The centroid 4 minimizes the sum of squared distances, which is what kmeans is supposed to find.

Best,

  Lutz