I've been looking for an implementation of k-means clustering in Python, and haven't really found anything I could use... I believe there is one in SciPy, but I'd rather keep the required number of packages as low as possible (already using Numeric/numarray), and Orange seems a bit hard to install in UNIX... So, I've fiddled with using Numeric/numarray for the purpose. Has anyone else done something like this (or some other clustering algorithm for that matter)?
The approach I've been using (but am not completely finished with) is to use a two-dimensional multiarray for the data (i.e. a "set" of vectors) and a one-dimensional array with a cluster assignment for each vector. E.g.
array([1, 2, 3, 4, 5])
array([1, 2, 4, 5, 4])
Here reps is the representative of the cluster.
Using argmin it should be relatively easy to assign each vector to the cluster with the closest representative (using sum((x-y)**2) as the distance measure), but how do I calculate the new representatives effectively? (The representative of a cluster, e.g., 10, should be the average of all vectors currently assigned to that cluster.) I could always use a loop and then compress() the data based on cluster number, but I'm looking for a way of calculating all the averages "simultaneously", to avoid using a Python loop... I'm sure there's a simple solution -- I just haven't been able to think of it yet. Any ideas?