[SciPy-User] kmeans

Sun Jul 25 18:24:57 EDT 2010

On Sun, Jul 25, 2010 at 3:17 PM, Keith Goodman <kwgoodman at gmail.com> wrote:
> On Sun, Jul 25, 2010 at 2:59 PM, David Cournapeau <cournape at gmail.com> wrote:
>> On Mon, Jul 26, 2010 at 6:53 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>> On Sun, Jul 25, 2010 at 12:41 PM, David Cournapeau <cournape at gmail.com> wrote:
>>>> On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>>> _kmeans chokes on large thresholds:
>>>>>
>>>>>>> from scipy import cluster
>>>>>>> v = np.array([1,2,3,4,10], dtype=float)
>>>>>>> cluster.vq.kmeans(v, 1, thresh=1e15)
>>>>>   (array([ 4.]), 2.3999999999999999)
>>>>>>> cluster.vq.kmeans(v, 1, thresh=1e16)
>>>>> <snip>
>>>>> IndexError: list index out of range
>>>>>
>>>>> The problem is in these lines:
>>>>>
>>>>>    diff = thresh+1.
>>>>>    while diff > thresh:
>>>>>        <snip>
>>>>>        if(diff > thresh):
>>>>>
>>>>> If thresh is large then (thresh + 1) > thresh is False:
>>>>>
>>>>>>> thresh = 1e16
>>>>>>> diff = thresh + 1.0
>>>>>>> diff > thresh
>>>>>   False
>>>>>
>>>>> What's a use case for a large threshold? You might want to study the
>>>>> algorithm by seeing the result after one iteration (not to be confused
>>>>> with the iter input which is something else).
>>>>>
>>>>> One fix is to use 2*thresh instead for thresh + 1. But that just
>>>>> pushes the problem out to higher thresholds
>>>>
>>>> Or just use the spacing function, which by definition returns the
>>>> smallest number M such as thresh + M > thresh (except for nan/inf)
>>>
>>> Neat, I've never heard of np.spacing. But it suffers the same fate:
>>>
>>> Works:
>>>
>>>>> thresh = 1e16
>>>>> diff = thresh + np.spacing(thresh)
>>>>> diff > thresh
>>>   True
>>>
>>> Doesn't work:
>>>
>>>>> thresh = 1e400
>>>>> diff = thresh + np.spacing(thresh)
>>>>> diff > thresh
>>>   False
>>
>> That's because 1e400 is inf for double precision numbers, and inf + N
>>> inf is never true :)
>
> That makes sense. But it is also the reason not to use np.spacing for
> kmeans. Entering thresh=np.inf seems reasonable if you want to make
> sure only one iteration is performed. Using
>
> if (diff > thesh) or (len(dist_arg) == 0)
>
> should fix it. Is the extra time OK for such a small corner case? I think so.

Oh, but I'm getting lost in a thicket of small issues. The big issue
is that the stopping condition is wrong. kmeans currently stops
iterating when changes in the mean sum of distances is below the
threshold. But the mean distance doesn't monotonically decrease. So
should we switch from changes in mean distance to changes in the root
of the mean squared distance? And then update the doc? Or have we
already decided on that? Unless I'm missing something (and I'm new to
kmeans) I don't think we have a choice.

David, what's your take?