[SciPy-User] kmeans

Fri Jul 23 13:48:47 EDT 2010

On Fri, Jul 23, 2010 at 1:36 PM, Benjamin Root <ben.root at ou.edu> wrote:

> On Fri, Jul 23, 2010 at 12:27 PM, David Cournapeau <cournape at gmail.com>wrote:
>
>> On Sat, Jul 24, 2010 at 2:19 AM, Benjamin Root <ben.root at ou.edu> wrote:
>>
>> >
>> > Examining further, I see that SciPy's implementation is fairly
>> simplistic
>> > and has some issues.  In the given example, the reason why 3 is never
>> > returned is not because of the use of the distortion metric, but rather
>> > because the kmeans function never sees the distance for using 3.  As a
>> > matter of fact, the actual code that does the convergence is in vq and
>> py_vq
>> > (vector quantization) and it tries to minimize the sum of squared
>> errors.
>> > kmeans just keeps on retrying the convergence with random guesses to see
>> if
>> > different convergences occur.
>>
>> As one of the maintainer of kmeans, I would be the first to admit the
>> code is basic, for good and bad. Something more elaborate for
>> clustering may indeed be useful, as long as the interface stays
>> simple.
>>
>> More complex needs should turn on scikits.learn or more specialized
>> packages,
>>
>> cheers,
>>
>> David
>>
>
> I agree, kmeans does not need to get very complicated because kmeans (the
> general concept) is not very suitable for very complicated situations.
>
> As a thought, a possible way to help out the current implementation is to
> ensure that unique guesses are made.  Currently, several iterations are
> wasted by performing guesses that it has already done before.  Is there a
> way to do sampling without replacement in numpy.random?
>
> Ben Root
>
>
Here is an old thread about initializing kmeans with/without replacement
http://old.nabble.com/kmeans-and-initial-centroid-guesses-td26938926.html

If scipy wants to use the most vanilla kmeans, then I suggest that it should
use sum of squares of errors everywhere it is currently using the sum of
errors.  If you really want to optimize the sum of errors, then the median
is probably a better cluster center than the mean, but adding more center
definitions would start to get more complicated.

Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/463eacf6/attachment.html>