[SciPy-User] kmeans

Benjamin Root ben.root at ou.edu
Thu Jul 22 12:24:09 EDT 2010


On Thu, Jul 22, 2010 at 10:42 AM, alex <argriffi at ncsu.edu> wrote:

> On Thu, Jul 22, 2010 at 10:48 AM, Benjamin Root <ben.root at ou.edu> wrote:
>
>> On Wed, Jul 21, 2010 at 3:25 PM, alex <argriffi at ncsu.edu> wrote:
>>
>>> Hi,
>>>
>>> I want to nitpick about the scipy kmeans clustering implementation.
>>> Throughout the documentation
>>> http://docs.scipy.org/doc/scipy/reference/cluster.vq.html and code, the
>>> 'distortion' of a clustering is defined as "the sum of the distances between
>>> each observation vector and its dominating centroid."  I think that the sum
>>> of squares of distances should be used instead of the sum of distances, and
>>> all of the miscellaneous kmeans descriptions I found with google would seem
>>> to support this.
>>>
>>> For example if one cluster contains the 1D points (1, 2, 3, 4, 10) and
>>> the old center is 3, then the centroid updating step will move the centroid
>>> to 4.  This step reduces the sum of squares of distances from 55 to 50, but
>>> it increases the distortion from 11 to 12.
>>>
>>> Alex
>>>
>>
>> Every implementation of kmeans (except for SciPy's) that I have seen
>> allowed for the user to specify which distance measure they want to use.
>> There is no right answer for a distance measure except for "it depends".
>> Maybe SciPy's implementation should be updated to allow for user-specified
>> distance measures (e.g. - absolute, euclidian, city-block, etc.)?
>>
>> Ben Root
>>
>
> While the best distance might depend on the application, I think that of
> all these distance measures only the sum of squares of Euclidean distances
> is guaranteed to monotonically decrease at each step of the algorithm.  If
> the scipy kmeans implementation depends on this monotonicity, which I think
> it currently does, then this assumption could be the source of subtle bugs
> when error measures like distortion are used.
>
> Maybe I can nitpick more effectively if I frame it as "scipy is doing
> something strange and possibly buggy" instead of as "scipy is not using the
> distance function I want".
>
>
It has been a while since I played with kmeans, but I believe that the
distance measure merely has to satisfy Minkowski's inequality which is that
the norm of a + b <= norm of a + norm of b (or was it the Cauchy-Schwarz
inequality?)



> For example:
>
> >>> import numpy as np
> >>> from scipy import cluster
> >>> v = np.array([1,2,3,4,10])
> >>> cluster.vq.kmeans(v, 1)
> (array([4]), 2.3999999999999999)
> >>> np.mean([abs(x-4) for x in v])
> 2.3999999999999999
> >>> np.mean([abs(x-3) for x in v])
> 2.2000000000000002
>
> The result of this kmeans call suggests that the center 4 is best with
> distortion 2.4.  In fact this is not the case because a center of 3 would
> have distortion 2.2.
>
>
I wonder if this is really a bug in the minimization code rather than an
issue with the distortion measure itself.

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100722/12d3536c/attachment.html>


More information about the SciPy-User mailing list