[SciPy-User] kmeans

Fri Jul 23 17:18:07 EDT 2010

On Jul 23, 2010, at 1:12 PM, Keith Goodman wrote:
> On Fri, Jul 23, 2010 at 1:01 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> On Fri, Jul 23, 2010 at 2:53 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>> That's a good point. Are all these considered "bugs"?
>>> 
>>> - Switch code and doc to use rmse
>>> - Integer bug
>>> - Select initial centroids without replacement
>> 
>> My vote is yes, although I am not 100% convinced that the integer bug should
>> be changed because it may cause breakage with those who have been depending
>> on integer output.
> 
> Maybe just make a ticket for now for the integer problem? Lutz, do you
> want to make the ticket?

I have opened a ticket (#1246). An easy and safe fix would be to simply add a statement to the docstring that warns the user about clustering integer data.

> It would be nice to find a simple problem that gives the wrong
> centroids due to the sum of dist bug. We could use that for a unit
> test of the fix. The example given earlier in the thread returns the
> right centroid. I guess we need a ticket for this one too.

Actually, it not entirely clear to me anymore what the bug is. According to the k-means Wikipedia page, the objective function that the algorithm tries to minimize is the total intra-cluster variance (the sum of squares of distances of data points from cluster centroids). However, the two steps of the iteration (assignment to centroids, and centroid update) use regular distances and means. Is this not what the current code is doing?

In the Wikipedia description, the iteration proceeds until no more changes are made. The SciPy implementation has an additional convergence criterion. I would have thought that the change in the sum of squared distances would be a better choice than the change in the sum of distances, but since this convergence criterion is not part of any other implementation of kmeans this may incorrect.

One issue I see is that the documentation mentions that k-means tries to minimize distorition, defined as the sum of distances, which (at least according to the Wikipedia page) is not correct, because it tries to minimize the sum of squared distances.

  -- Lutz