[SciPy-User] kmeans (Re: Mailing list?)
David Warde-Farley
dwf at cs.toronto.edu
Mon Nov 23 01:10:03 EST 2009
On 22-Nov-09, at 1:46 PM, Simon Friedberger wrote:
> I just looked at the documentation for the K-Means vector quantization
> functions and I am a bit confused. On the one hand it says that
> normalization to unit variance would be beneficial on the other hand
> there are a lot of "must"s in the descriptions.
> I was wondering if it is possible to use the functions without
> normalization or if there is a negative impact.
Damian did use some strong language there. The kmeans function won't
know whether you've normalized or not, but in some cases you can
expect much better solutions with normalized input (the function is
called "whiten" which is somewhat misleading, as "whitening" is often
used in the literature to mean decorrelating i.e. rotating by the
eigenvectors of the covariance).
kmeans uses the Euclidean distance, meaning that the distance between
two points is the sum of the squared difference of each point's
coordinates. If you have different coordinates that have vastly
different scales, say some in the thousands and some that are always
less than one, then one or two coordinates can dominate the distance
calculation and make the other coordinates nearly irrelevant in the
clustering (if the difference in scale is _really_ big then you can
lose precision in the rounding error, too).
Units are almost always arbitrary, and so scaling by the standard
deviation helps this in that it treats all of your features
"equally" (you may also want to subtract the mean before calling
whiten(), as well - this is in fact one of the standard tricks, it
doesn't matter so much here but it can help with numerical
conditioning in a lot of algorithms, particularly ones that involve
gradient descent).
If you want to quantize new vectors after clustering then you can
simply apply the reverse transformation to your codebook/centroids
(i.e. multiply by the std. dev. of the original data and add back the
mean if you subtracted it) which will scale them to the correct
position in the space of the original data.
> This would make sense because it seems reasonable that one would
> want to
> build the codebook on some set and then quantize a different set. In
> this case normalization would have to be the same or be omitted.
Yes, you could apply the normalization you applied to the training
data on every point you wish to quantize after the fact, but it's
usually easier to just apply the inverse transformation to the
codebook (especially if the number of points you want to quantize
greatly exceeds the number of items in your codebook, in which case
you save a lot of floating point ops).
> I am also interested in literature recommendations concerning why this
> is a good idea in general.
Well, lots of references will tell you what I just told you (that high-
variance features will dominate distance calculations if you don't
normalize). There is some discussion of it in the 'Prototype Methods'
chapter in 'The Elements of Statistical Learning': http://www-stat.stanford.edu/~tibs/ElemStatLearn/
David
More information about the SciPy-User
mailing list