[SciPy-User] kmeans

Keith Goodman kwgoodman at gmail.com
Fri Jul 23 16:12:06 EDT 2010


On Fri, Jul 23, 2010 at 1:01 PM, Benjamin Root <ben.root at ou.edu> wrote:
> On Fri, Jul 23, 2010 at 2:53 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>
>> On Fri, Jul 23, 2010 at 12:40 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> > On Fri, Jul 23, 2010 at 2:06 PM, Lutz Maibaum <lutz.maibaum at gmail.com>
>> > wrote:
>> >>
>> >> On Fri, Jul 23, 2010 at 11:54 AM, Keith Goodman <kwgoodman at gmail.com>
>> >> wrote:
>> >> > On Fri, Jul 23, 2010 at 11:39 AM, Lutz Maibaum
>> >> > <lutz.maibaum at gmail.com>
>> >> > wrote:
>> >> >> To be compatible with the (at least to me!) standard use of k-means,
>> >> >> I
>> >> >> think both code and doc should use the sum of squared distances as
>> >> >> the
>> >> >> cost function in the optimization, and also as the return value.
>> >> >
>> >> > What about the thresh (threshold) input parameter? If the sum of
>> >> > squares were used then the user would have to adjust the threshold
>> >> > for
>> >> > the number of data points.
>> >>
>> >> That's true, but personally I don't find that much of a problem. Using
>> >> an absolute threshold one needs to have some intuition about the
>> >> magnitude of the cost function based on the type and amount of data.
>> >> Alternatively, one could use a relative improvement as the convergence
>> >> criterion, for example (something like "if
>> >> (old_cost-new_cost)/old_cost < threshhold then converged"), which may
>> >> be suitable for a larger variety of clustering problems.
>> >>
>> >>  -- Lutz
>> >
>> > However, we wouldn't want to change the characteristic behavior of
>> > kmeans...
>> > yet.
>>
>> That's a good point. Are all these considered "bugs"?
>>
>> - Switch code and doc to use rmse
>> - Integer bug
>> - Select initial centroids without replacement
>
> My vote is yes, although I am not 100% convinced that the integer bug should
> be changed because it may cause breakage with those who have been depending
> on integer output.

Maybe just make a ticket for now for the integer problem? Lutz, do you
want to make the ticket?

It would be nice to find a simple problem that gives the wrong
centroids due to the sum of dist bug. We could use that for a unit
test of the fix. The example given earlier in the thread returns the
right centroid. I guess we need a ticket for this one too.



More information about the SciPy-User mailing list