[SciPy-User] kmeans

Benjamin Root ben.root at ou.edu
Fri Jul 23 22:24:22 EDT 2010


On Fri, Jul 23, 2010 at 8:56 PM, Benjamin Root <ben.root at ou.edu> wrote:

> On Fri, Jul 23, 2010 at 7:53 PM, Keith Goodman <kwgoodman at gmail.com>wrote:
>
>> On Fri, Jul 23, 2010 at 5:46 PM, Benjamin Root <ben.root at ou.edu> wrote:
>> > On Fri, Jul 23, 2010 at 6:48 PM, Keith Goodman <kwgoodman at gmail.com>
>> wrote:
>> >>
>> >> On Fri, Jul 23, 2010 at 4:00 PM, Benjamin Root <ben.root at ou.edu>
>> wrote:
>> >>
>> >> > The stopping condition uses the change in the distortion, not a
>> >> > non-squared
>> >> > distance.  The distortion is already a sum of squares.  The only
>> place
>> >> > that
>> >> > a non-squared distance is used is in _py_vq_1d() which appears to be
>> >> > very
>> >> > old code and it has a raise error at the very first statement.
>> >>
>> >> That's good news.
>> >>
>> >> Another place that a non-squared distance is used is the return value:
>> >>
>> >> >> import numpy as np
>> >> >> from scipy import cluster
>> >> >> v = np.array([1,2,3,4,10],dtype=float)
>> >> >> cluster.vq.kmeans(v, 1)
>> >>   (array([ 4.]), 2.3999999999999999)
>> >>
>> >> >> np.sqrt(np.dot(v-4, v-4) / 5.0)
>> >>   3.1622776601683795  # Nope, not returned
>> >> >> np.absolute(v - 4).mean()
>> >>   2.3999999999999999 # Yep, this one is returned
>> >>
>> >> Is that a code bug or a doc bug?
>> >
>> > Well, see, that's just the thing... the doc says that it returns the
>> > distortion, which is what it does, but obviously, this distortion was a
>> MAE
>> > and not a RMSE.  The problem is that I have gone backwards and forwards
>> over
>> > the codes, including the Cython version, and I can't find anyplace where
>> > this is happening.
>> >
>> > Does anybody know of any good code tracing tools?  I used trace once,
>> but it
>> > wasn't very user-friendly...
>>
>> I think I see it! Yes, the squared distance is calculated. But before
>> it is summed or meaned, the square root is taken. That turns the
>> squared distance into just distance.
>>
>
> Are you talking about the sqrt in py_vq()?  That doesn't get called in the
> given example... however, you are right that the list of distances that is
> being returned are being square-rooted before the return.  It is happening
> in the C code, though, and I just don't know where...
>
>
Actually, I think I see it now.  in src/vq.c, you have the function
double_vq_obs() which finds out which centroid an obs should be assigned to
and it calculates a euclidean distance, as it should, and returns the
smallest distance and the centroid it matched best with.  This info is
passed to double_tvq(), which does this for each observation.  double_tvq()
is called by compute_vq() in src/vq_module.c, which is the function called
by _vq.vq() in vq.py...

That array of distances is what gets passed into the mean() call in
_kmeans().  Therefore, either we need to square the returned value, or
remove all of the square roots elsewhere (making sure we put a square root
when we are done, of course...).

Ben Root
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20100723/0ed3752f/attachment.html>


More information about the SciPy-User mailing list