[Numpy-discussion] (K-Mean) Clustering

Tue Jun 25 05:04:04 EDT 2002

> Using argmin it should be relatively easy to assign each vector to the
> cluster with the closest representative (using sum((x-y)**2) as the
> distance measure), but how do I calculate the new representatives
> effectively? (The representative of a cluster, e.g., 10, should be the
> average of all vectors currently assigned to that cluster.) I could
> always use a loop and then compress() the data based on cluster
> number, but I'm looking for a way of calculating all the averages
> "simultaneously", to avoid using a Python loop... I'm sure there's a
> simple solution -- I just haven't been able to think of it yet. Any
> ideas?

Maybe this helps (old code, may contain some suboptimal or otherwise
weird things):

from Numeric import *
from RandomArray import randint
import sys

def squared_distances(X,Y):
    return add.outer(sum(X*X,-1),sum(Y*Y,-1))- 2*dot(X,transpose(Y))

def kmeans(data,M,
	   wegstein=0.2,
	   r_convergence=0.001,
	   epsilon=0.001, debug=0, minit=20):
    """Computes kmeans for DATA with M centers until convergence
in the sense that relative change of the quantization error is less than
the optional RCONV (3rd param). WEGSTEIN (2nd param), by default .2 but always
between 0 and 1, stabilizes the convergence process. 
EPSILON is used to quarantee centers are initially all different. 
DEBUG causes some intermediate output to appear to stderr.
Returns centers and the average (squared) quantization error.
"""
    N,D=data.shape

    # Selecting the initial centers has to be done carefully.
    # We have to ensure all of them are different, otherwise the
    # algorithm below will produce empty classes.
    centers=[]
    if debug:
	sys.stderr.write("kmeans: Picking centers.\n")
    while len(centers)<M:
	# Pick one data item randomly
	candidate=data[randint(N)]
	if len(centers)>0:
	    d=minimum.reduce(squared_distances(array(centers),
							  candidate))
	else:
	    d=2*epsilon
	if d>epsilon: centers.append(candidate)

    if debug:
	sys.stderr.write("kmeans: Iterating.\n")
    centers=array(centers)
    qerror,old_qerror,counter=None,None,0
    while (counter<minit 
	   or old_qerror==None 
	   or (old_qerror-qerror)/qerror>r_convergence):
	# Initialize
	# Not like this, you get doubles: centers=take(data,randint(0,N,(M,)))
	# Iterate:
	# Squared distances from data to centers (all pairs)
	distances=squared_distances(data,centers)
	# Matrix telling which data item is closest to which center
	x=equal.outer(argmin(distances),
		      arange(centers.shape[0])).astype(Float32)
	# Compute new centers
	centers=(  (    wegstein)*(dot(transpose(x),data)/sum(x)[...,NewAxis])
	         + (1.0-wegstein)*centers)
	# Quantization error
	old_qerror=qerror
	qerror=sum(minimum.reduce(distances,1))/N
	counter=counter+1
	if debug:
	    try:
		sys.stderr.write("%f %f %i\n" %(qerror,old_qerror,counter))
	    except TypeError:
		sys.stderr.write("%f None %i\n" %(qerror,counter))
    return centers, qerror

-- 
Janne