[SciPy-User] kmeans (Re: Mailing list?)

Mon Nov 23 01:10:03 EST 2009

On 22-Nov-09, at 1:46 PM, Simon Friedberger wrote:

> I just looked at the documentation for the K-Means vector quantization
> functions and I am a bit confused. On the one hand it says that
> normalization to unit variance would be beneficial on the other hand
> there are a lot of "must"s in the descriptions.
> I was wondering if it is possible to use the functions without
> normalization or if there is a negative impact.

Damian did use some strong language there. The kmeans function won't  
know whether you've normalized or not, but in some cases you can  
expect much better solutions with normalized input (the function is  
called "whiten" which is somewhat misleading, as "whitening" is often  
used in the literature to mean decorrelating i.e. rotating by the  
eigenvectors of the covariance).

kmeans uses the Euclidean distance, meaning that the distance between  
two points is the sum of the squared difference of each point's  
coordinates. If you have different coordinates that have vastly  
different scales, say some in the thousands and some that are always  
less than one, then one or two coordinates can dominate the distance  
calculation and make the other coordinates nearly irrelevant in the  
clustering (if the difference in scale is _really_ big then you can  
lose precision in the rounding error, too).

Units are almost always arbitrary, and so scaling by the standard  
deviation helps this in that it treats all of your features  
"equally" (you may also want to subtract the mean before calling  
whiten(), as well - this is in fact one of the standard tricks, it  
doesn't matter so much here but it can help with numerical  
conditioning in a lot of algorithms, particularly ones that involve  
gradient descent).

If you want to quantize new vectors after clustering then you can  
simply apply the reverse transformation to your codebook/centroids  
(i.e. multiply by the std. dev. of the original data and add back the  
mean if you subtracted it) which will scale them to the correct  
position in the space of the original data.

> This would make sense because it seems reasonable that one would  
> want to
> build the codebook on some set and then quantize a different set. In
> this case normalization would have to be the same or be omitted.

Yes, you could apply the normalization you applied to the training  
data on every point you wish to quantize after the fact, but it's  
usually easier to just apply the inverse transformation to the  
codebook (especially if the number of points you want to quantize  
greatly exceeds the number of items in your codebook, in which case  
you save a lot of floating point ops).

> I am also interested in literature recommendations concerning why this
> is a good idea in general.

Well, lots of references will tell you what I just told you (that high- 
variance features will dominate distance calculations if you don't  
normalize). There is some discussion of it in the 'Prototype Methods'  
chapter in 'The Elements of Statistical Learning': http://www-stat.stanford.edu/~tibs/ElemStatLearn/

David