[SciPy-User] Kmeans, the correct role of whitening?
Luca Giacomel
luca.giacomel at gmail.com
Fri May 6 04:03:47 EDT 2011
Hello,
I'm developing a piece of code which uses the kmeans algorithm to cluster huge amounts of financial data. I've got a few problems with the whitening function though. How and when should I use it? Let's make it simpler, my initial idea was this:
whitened_points=whiten(array([point.getDimensionalFields().values() for point in Point.objects.all()])) #This is simply a django code for retrieving values from the db.
clusters, variance = kmeans(whitened_points,number_of_clusters)
And now I'm happy because scipy generated some wonderful clusters for me. Next step is to take another observation and check to which cluster it belongs using vector quantization. But how should I use the whitening function to get unit-variance for the observation? I propose three approaches, which is correct?
1
#Take the observation and add it to all the points, so variance is correctly calculated.
#Problem: I'm going to use A LOT of memory for every single check which doesn't seem correct.
whitened_points=whiten(array([point.getDimensionalFields().values() for point in Point.objects.all()]+obs)) #add the obs to all the other values, terrible approach.
code, distance = vq(whitened_points[-1],codebook) #use the -1 to get the obs normalized
2
#Take the observation and add it to all the centroids.
#Problem: Has it any sense?
whitened_points=whiten(array([cluster.centroid.getDimensionalFields().values() for cluster in Cluster.objects.all()]+obs)) #add the obs to all the centroids, uses way less ram as I've got only 50 clusters
code, distance = vq(whitened_points[-1],codebook) #use the -1 to get the obs normalized
3
#Whiten the observation with itself
#Problem: Has it any sense? (again)
whitened_points=whiten(array(obs)) #fast :)
code, distance = vq(whitened_points,codebook) #use the -1 to get the obs normalized
Thanks in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20110506/936d2b72/attachment.html>
More information about the SciPy-User
mailing list