[SciPy-User] kmeans and initial centroid guesses

Sun Dec 27 20:37:21 EST 2009

The kmeans function has two modes. In one of the modes the initial
guesses for the centroids are randomly selected from the input data.
The selection is currently done with replacement:

guess = take(obs, randint(0, No, k), 0)

That means some of the centroids in the intial guess might be the
same. Wouldn't it be better to select without replacement? Something
like

guess = take(obs, rand(No).argsort()[:k], 0)

Here's an extreme example of what can go wrong if the selection is
done with replacement:

>> obs

array([[ 1,  1],
       [-1, -1],
       [-1,  1],
       [ 1, -1]])
>> vq.kmeans(obs, k_or_guess=4)

(array([[-1, -1],
       [-1,  1],
       [ 1, -1],
       [ 1,  1]]), 0.0) # <--- good
>>
>> k_or_guess = obs[[1,1,1,1],:]
>> k_or_guess

array([[-1, -1],
       [-1, -1],
       [-1, -1],
       [-1, -1]])
>> vq.kmeans(obs, k_or_guess)
   (array([[0, 0]]), 1.4142135623730951) # <--- not as good

In most cases it won't make any difference. But the cost of the code
change is small.