[SciPy-User] kmeans
Keith Goodman
kwgoodman at gmail.com
Sun Jul 25 18:17:55 EDT 2010
On Sun, Jul 25, 2010 at 2:59 PM, David Cournapeau <cournape at gmail.com> wrote:
> On Mon, Jul 26, 2010 at 6:53 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>> On Sun, Jul 25, 2010 at 12:41 PM, David Cournapeau <cournape at gmail.com> wrote:
>>> On Sun, Jul 25, 2010 at 2:36 AM, Keith Goodman <kwgoodman at gmail.com> wrote:
>>>> _kmeans chokes on large thresholds:
>>>>
>>>>>> from scipy import cluster
>>>>>> v = np.array([1,2,3,4,10], dtype=float)
>>>>>> cluster.vq.kmeans(v, 1, thresh=1e15)
>>>> (array([ 4.]), 2.3999999999999999)
>>>>>> cluster.vq.kmeans(v, 1, thresh=1e16)
>>>> <snip>
>>>> IndexError: list index out of range
>>>>
>>>> The problem is in these lines:
>>>>
>>>> diff = thresh+1.
>>>> while diff > thresh:
>>>> <snip>
>>>> if(diff > thresh):
>>>>
>>>> If thresh is large then (thresh + 1) > thresh is False:
>>>>
>>>>>> thresh = 1e16
>>>>>> diff = thresh + 1.0
>>>>>> diff > thresh
>>>> False
>>>>
>>>> What's a use case for a large threshold? You might want to study the
>>>> algorithm by seeing the result after one iteration (not to be confused
>>>> with the iter input which is something else).
>>>>
>>>> One fix is to use 2*thresh instead for thresh + 1. But that just
>>>> pushes the problem out to higher thresholds
>>>
>>> Or just use the spacing function, which by definition returns the
>>> smallest number M such as thresh + M > thresh (except for nan/inf)
>>
>> Neat, I've never heard of np.spacing. But it suffers the same fate:
>>
>> Works:
>>
>>>> thresh = 1e16
>>>> diff = thresh + np.spacing(thresh)
>>>> diff > thresh
>> True
>>
>> Doesn't work:
>>
>>>> thresh = 1e400
>>>> diff = thresh + np.spacing(thresh)
>>>> diff > thresh
>> False
>
> That's because 1e400 is inf for double precision numbers, and inf + N
>> inf is never true :)
That makes sense. But it is also the reason not to use np.spacing for
kmeans. Entering thresh=np.inf seems reasonable if you want to make
sure only one iteration is performed. Using
if (diff > thesh) or (len(dist_arg) == 0)
should fix it. Is the extra time OK for such a small corner case? I think so.
More information about the SciPy-User
mailing list