[SciPy-user] Kmeans help and C source

Tue Jan 8 10:32:07 EST 2002

Hey Costas,

> Secondly, as I mentioned above, I need a modified kmeans.  However, I see no 
> C/Fortran code in the src tarball or CVS that seems related to kmeans.  Is 
> the base code available?  If so, is it hackable by a SWIG newbie? (I am 
> aware of SWIG, but I have never used it for anything serious).

The vq algorithms are all available in Python versions in scipy.cluster.vq.  The kmeans algorithm is written in pure Python, and should be easy to modify -- no SWIG necessary.  The time consuming part of kmeans is the vector qunatization (vq) algorithm.  It has both pure python and C implementations.  If your interested in the algorithm, study the Python.  The C code does the same thing, but is less obvious.  The C version is available in a source distribution of scipy.  It lives in scipy/cluster/vq/src.  If you need to modify vq also, my advice is to do it in Python.  It will be easy to move to C later on.  

I'm working on a more detailed note about this stuff, and its beginnings are at the end of this note.  Here are is an example to at least get you rolling:

--------------------------------------------------------
#ex1.py
from scipy import *
from scipy.cluster import vq

def cluster_data(data,cluster_cnt,iter=20,thresh=1e-5):
    """ Group data into a number of common clusters

        data -- 2D array of data points.  Each point is a row in the array.
        cluster_cnt -- The number of clusters to use
        iter -- number of iterations to use for kmeans algorithm
        thresh -- distortion threshold for kmeans algorithm

        return -- list of 2D arrays.  Each array contains the data points
                  that belong to a specific cluster.

        Uses kmeans algorithm to find the clusters.                   
    """
    wh_data = vq.whiten(data)
    code_book,dist = vq.kmeans(wh_data,cluster_cnt,iter,thresh)
    code_ids, distortion = vq.vq(wh_data,code_book)
    clusters = []
    for i in range(len(code_book)):
        cluster = compress(code_ids == i,data,0)
        clusters.append(cluster)
    return clusters

if __name__ == "__main__":

    data = array(((400, 79, 5.4),
             (180, 76, 4.5),
             (28,  25, 30.),
             (270, 81, 5.0),
             (185, 78, 4.6)))

    clusters = cluster_data(data,2)
    for i in range(len(clusters)):
        print 'cluster %d:' % i
        print clusters[i]

--------------------------------------------------------        
My (partial) more detailed response follows:

> --I confess to not understanding the docstring

Sorry for the confusion.  I'll try to explain thing more clearly.  If it works,
will use this as the doc. :)

> However, I am not quite following the kmeans() functionality (I am new to 
> this clustering business, so this maybe a stupid newbie question): my docs 
> tell me that kmeans should partition a dataset into k clusters.  So, I 
> expect vq.kmeans(dataset, 2) to return to me dataset split up into two 
> "equivalent" datasets.

Splitting the data into two data sets is actually a two or three step process. 
Here's a complete example.

"Observations" are just another name for a data point.  The obs matrix is a 2D 
array of data points.  For example if our data set includes height, weight, and 
40-yard speed of football players, you might have the following (fictitous) 
data:

obs:
                            lb        inches     seconds       
                           ----------------------------
    Refrigerator Perry    | 400         79         5.4
    Jerry Rice            | 180         76         4.5
    Zachary Jones         |  28         25        30.0
    Too Tall Jones        | 270         81         5.0
    Charlie Joiner        | 185         78         4.6

The data above is the "obs" matrix.  Each row in the 2D array is a data point, 
often called an "observation" or "sample".  Each column is sometimes called the 
"features" of a player. Imagine, we want to split this data set into two 
"clusters", perhaps dividing the data into linemen and receivers.  One way to 
find two "codes", one to represent each of these groups.  (I think the term 
"code" comes from communication theory, but I'm not sure.  Perhaps "classes" is 
more descriptive.)  I watched enough games (observed enough players) to make an 
educated guess as to what these codes might be:

possible code book:

                code        lb         inches     seconds
                            -----------------------------
    receiver     0         | 180          75         4.8
    lineman      1         | 260          76         5.5

So code 0 stands for a "typical" receiver and code 1 represents your typical 
lineman.  "Vector quantization" is an algorithm that calculates the distance 
between a data point and every code in the "code book" to find the closest one 
(i.e. which class is the best match). In scipy.cluster, the vq module houses 
the vector quantization tools. vq.vq(obs,code_book) returns 2 arrays -- (1) the 
index (row in the code book) of the code nearest to each data point, and (2) 
the distance that each data point is from that code. code_book is always a 2D
array.  If obs is a 2D array, each row is a separate data point.

    # note I rounded some numbers to save space
    >>> obs = array(((400, 79, 5.4),
    ...              (180, 76, 4.5),
    ...              (28,  25, 30.),
    ...              (270, 81, 5.0),
    ...              (185, 78, 4.6)))
    >>> code_book = array(((180, 75, 4.8),
    ...                    (260, 76, 5.5)))
    >>> code_id, distortion = vq.vq(obs,code_book)
    >>> code_id
    array([1, 0, 0, 1, 0])
    >>> distortion
    array([ 140.03, 1.045, 161.985, 11.192, 5.834]) 

code_id now tells what position each of the football players is most likely to 
play. Distortion is the distance, using sqrt( a^2 + b^2 + c^2), that each 
player is from the code (typical player) in their category.  Low numbers mean 
the match is good.  For example, vq tells us that Jerry Rice is receiver, and 
the low distortion means that it is pretty dang sure about that. 

                             code_id          distortion
                         ------------------------------ 
    Refrigerator Perry     1 --> lineman         140.03      
    Jerry Rice             0 --> receiver          1.04
    Zachary Jones          0 --> receiver        161.99
    Too Tall Jones         1 --> lineman          11.19
    Charlie Joiner         0 --> receiver          5.83

Most of the classifications make sense, but the distortions have some problems. 
Noteably that my 1 year old son is about as likely to be a receiver as R. Perry 
is to be a lineman.  Looking at the data, it is obvious that R. Perry is a 
lineman. It isn't obvious where Zach falls because he's small (like a receiver) 
and slow (like a lineman).  So we should be quite a bit more sure about the 
fact that R. Perry is a lineman.  What's wrong?  Well, the distortion value's 
primary contribution comes from the large weight differences between these 
players and the typical players.  Even though Zach's speed is a long way from 
receiver's speed this doesn't contribute to the distortion much.  That's bad 
because the speed difference carries a lot of useful information.  For more 
accurate distortion values, we would like each feature of a player 
(weight, height, and speed) to be weighted equally.  One way of doing this is 
to "whiten" the features of both the data and the code book by dividing each 
feature by the standard deviation of that feature set.  

The standard deviation of weights and speeds across all the players are:

    #weight
    >>> stats.stdev((400,180,28,270,185))
    136.30407183939883
    #speed
    >>> stats.stdev((5.4,4.5,30.0,5.0,4.6))
    11.241885962773329

So the whitened weight and speed distortions for R. Perry from a typical 
lineman is:

    # whitened weight
    >>> abs(400-260)/136.3
    1.0271460014673512   
    # whitened speed
    >>> abs(5.4-5.5)/11.24
    0.0088967971530248789

So the whitened weight and speed distortions for Z. Jones from a typical 
receiver is:

    # whitened weight
    >>> abs(28-180)/136.3
    1.1151870873074101
    # whitened speed
    >>> (30.0-4.8)/11.24
    2.2419928825622777

It is apparent from these values that Zach's speed difference is gonna have
a large affect on the distortion now.  

The whiten() function handles the chore of whitening a data set.

    >>> wh_obs = whiten(obs)

Usually, the code book is actually calculated from a whitened set of data (via 
kmeans or some other method), and thus are automatically white.  In this case,
I specified the code_book in non-whitened units, so they'll need to be normalized
by the same standard deviation of the data set.

    # not normally needed
    >>> wh_code_book = code_book / std(obs,axis=0)

Now, rerunning vq gives:

    >>> code_id, distortion = vq.vq(wh_obs,wh_code_book)
    >>> code_id
    array([1, 0, 0, 1, 0])
    >>> distortion
    array([ 1.034, 0.049,  3.257 ,  0.225,  0.131])

                             code_id         whitened distortion
                         --------------------------------------- 
    Refrigerator Perry     1 --> lineman             1.034      
    Jerry Rice             0 --> receiver            0.049
    Zachary Jones          0 --> receiver            3.257
    Too Tall Jones         1 --> lineman             0.225
    Charlie Joiner         0 --> receiver            0.131

Now Zach's distortion is much higher than everyone elses which makes sense.

In the example above, I made an educated guess based on my knowledge of 
football of what the size and speed of a typical player in a given position 
might be.  This is information I had to supply to the clustering algorithm
before it could determine how to classify each player.  Suppose you didn't
know anything about football, and watched a football gamebut were given a 
list of players with there position, weight, height, and speed.  You could 
use this information to educate yourself about the traits of a typical 
receiver, linebacker, lineman, etc.  The kmeans algorithm also does this.  
It takes a set of data, and the number of positions you want to categorize, 
and spits out an appropriate code book.

<continued soon>

----- Original Message ----- 
From: "Costas Malamas" <costasm at hotmail.com>
To: <scipy-user at scipy.net>
Sent: Monday, January 07, 2002 5:50 PM
Subject: [SciPy-user] Kmeans help and C source

> Hi all,
> 
> I need to use a modified K-means algorithm for a project and I was delighted 
> to discover that SciPy includes a python wrapper for a kmeans() function.
> 
> However, I am not quite following the kmeans() functionality (I am new to 
> this clustering business, so this maybe a stupid newbie question): my docs 
> tell me that kmeans should partition a dataset into k clusters.  So, I 
> expect vq.kmeans(dataset, 2) to return to me dataset split up into two 
> "equivalent" datasets.  However, anyway I feed my data into vq.kmeans() this 
> doesn't happen (e.g. I feed it a 5x4 dataset and I get back two 5x1 
> vectors).  
> My guess is that either this vq.kmeans() does something different 
> --I confess to not understanding the docstring as the observation/codebook 
> terminology has no parallel to the docs I've read-- or that I am not doing 
> something right.  Any pointers? Even some documentation on the algorithm 
> would be great help.
> 
> Secondly, as I mentioned above, I need a modified kmeans.  However, I see no 
> C/Fortran code in the src tarball or CVS that seems related to kmeans.  Is 
> the base code available?  If so, is it hackable by a SWIG newbie? (I am 
> aware of SWIG, but I have never used it for anything serious).
> 
> Any and all info will be greatly appreciated :-) --and thanks for SciPy!
> 
> 
> Costas Malamas
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20020108/fafd8faa/attachment.html>