List behaviours with Clustering Algorithm
Peter Otten
__peter__ at web.de
Wed Jan 26 04:43:06 EST 2011
James Ravenscroft wrote:
>> > I can't run your code because you didn't make it standalone,
> Thanks for the heads up, I've made a simple version of the clusterer
> which you can view on pastebin: http://pastebin.com/7HmAkmfj If you have
> time to look through my code I would be very grateful!
>
>
>
>> > but in your case that is probably not enough.
>> > Try something along these lines:
>> >
>> > # untested
>> > while len(self.clusters) > 1:
>> > c = self.clusters.pop()
>> > # find closest index
>> > for i, c2 in enumerate(self.clusters):
>> > ...
>> > if ...:
>> > closest_index = i
>> > closest = self.clusters.pop(closest_index)
>> > tmp.append(c + closest)
>> > if self.clusters:
>> > tmp.append(self.clusters[0])
> I had a go at implementing something along the lines above and I'm still
> getting very bizarre results. There does appear to be some sort of logic
> to it though, if you look at the graph chart, you can see that It seems
> to be doing the clustering right and then forgetting to remove the old
> groupings providing this "cloning" effect for some cluster groups.
I'm sorry I can't infer the intended algorithm from the code you provide.
Perhaps you can give a short description in plain English?
However, here's the implementation of the algorithm you mention as described
on wikipedia:
http://en.wikipedia.org/wiki/Cluster_analysis#Agglomerative_hierarchical_clustering
Repeatedly merge the two closest clusters until there's only one left.
To keep track of the two merged clusters I've added a "children" key to your
node dictionaries. The intermediate states of the tree are also put into the
levels variable (I suppose that's the purpose of your levels attribute).
The main difference to your code is that
(1) list modifications occur only in the outer while loop, so both inner
loops can become for-loops again.
(2) test all pairs before choosing the pair
debug = True
def cluster(self):
'''Run the clustering operation on the files
'''
clusters = []
#add a cluster for each track to clusters
for song in self.music:
clusters.append({'data' : [song]})
levels = []
while len(clusters) > 1:
# calculate weights
for c in clusters:
c["weight"] = self.__getClusterWeight(c['data'])
# find closest pair
closestDelta = float('inf')
closestPair = None
for i, a in enumerate(clusters):
for k, b in enumerate(clusters):
if i == k:
break
delta = abs(a['weight'] - b['weight'])
if delta < closestDelta:
closestPair = i, k
closestDelta = delta
# merge closest pair
hi, lo = closestPair
left = clusters[lo]
right = clusters.pop(hi)
clusters[lo] = {"data": left["data"] + right["data"],
"children": [left, right]}
# keep track of the intermediate tree states
levels.append(list(clusters))
if self.debug:
print ("stage %d" % len(levels)).center(40, "-")
for c in clusters:
print c["data"]
# store tree
self.clusters = clusters
self.levels = levels
Note that there's some potential for optimisation. For example, you could
move the weight calculation out of the while-loop and just calculate the
weight for the newly merged node.
Peter
More information about the Python-list
mailing list