[SciPy-User] kmeans

Keith Goodman kwgoodman at gmail.com
Sun Jul 25 18:17:55 EDT 2010


On Sun, Jul 25, 2010 at 2:59 PM, David Cournapeau <cournape at gmail.com> wrote:
> On Mon, Jul 26, 2010 at 6:53 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Sun, Jul 25, 2010 at 12:41 PM, David Cournapeau <cournape at gmail.com> wrote:
>>> On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> _kmeans chokes on large thresholds:
>>>>
>>>>>> from scipy import cluster
>>>>>> v = np.array([1,2,3,4,10], dtype=float)
>>>>>> cluster.vq.kmeans(v, 1, thresh=1e15)
>>>>   (array([ 4.]), 2.3999999999999999)
>>>>>> cluster.vq.kmeans(v, 1, thresh=1e16)
>>>> <snip>
>>>> IndexError: list index out of range
>>>>
>>>> The problem is in these lines:
>>>>
>>>>    diff = thresh+1.
>>>>    while diff > thresh:
>>>>        <snip>
>>>>        if(diff > thresh):
>>>>
>>>> If thresh is large then (thresh + 1) > thresh is False:
>>>>
>>>>>> thresh = 1e16
>>>>>> diff = thresh + 1.0
>>>>>> diff > thresh
>>>>   False
>>>>
>>>> What's a use case for a large threshold? You might want to study the
>>>> algorithm by seeing the result after one iteration (not to be confused
>>>> with the iter input which is something else).
>>>>
>>>> One fix is to use 2*thresh instead for thresh + 1. But that just
>>>> pushes the problem out to higher thresholds
>>>
>>> Or just use the spacing function, which by definition returns the
>>> smallest number M such as thresh + M > thresh (except for nan/inf)
>>
>> Neat, I've never heard of np.spacing. But it suffers the same fate:
>>
>> Works:
>>
>>>> thresh = 1e16
>>>> diff = thresh + np.spacing(thresh)
>>>> diff > thresh
>>   True
>>
>> Doesn't work:
>>
>>>> thresh = 1e400
>>>> diff = thresh + np.spacing(thresh)
>>>> diff > thresh
>>   False
>
> That's because 1e400 is inf for double precision numbers, and inf + N
>> inf is never true :)

That makes sense. But it is also the reason not to use np.spacing for
kmeans. Entering thresh=np.inf seems reasonable if you want to make
sure only one iteration is performed. Using

if (diff > thesh) or (len(dist_arg) == 0)

should fix it. Is the extra time OK for such a small corner case? I think so.



More information about the SciPy-User mailing list