[SciPy-User] kmeans and initial centroid guesses
Keith Goodman
kwgoodman at gmail.com
Sun Dec 27 20:37:21 EST 2009
The kmeans function has two modes. In one of the modes the initial
guesses for the centroids are randomly selected from the input data.
The selection is currently done with replacement:
guess = take(obs, randint(0, No, k), 0)
That means some of the centroids in the intial guess might be the
same. Wouldn't it be better to select without replacement? Something
like
guess = take(obs, rand(No).argsort()[:k], 0)
Here's an extreme example of what can go wrong if the selection is
done with replacement:
>> obs
array([[ 1, 1],
[-1, -1],
[-1, 1],
[ 1, -1]])
>> vq.kmeans(obs, k_or_guess=4)
(array([[-1, -1],
[-1, 1],
[ 1, -1],
[ 1, 1]]), 0.0) # <--- good
>>
>> k_or_guess = obs[[1,1,1,1],:]
>> k_or_guess
array([[-1, -1],
[-1, -1],
[-1, -1],
[-1, -1]])
>> vq.kmeans(obs, k_or_guess)
(array([[0, 0]]), 1.4142135623730951) # <--- not as good
In most cases it won't make any difference. But the cost of the code
change is small.
More information about the SciPy-User
mailing list