List behaviours with Clustering Algorithm
Peter Otten
__peter__ at web.de
Mon Jan 24 06:43:34 EST 2011
James Ravenscroft wrote:
> Dear All,
>
> I am currently trying to write a simple Agglomerative Clustering
> algorithm which sorts through my MP3 collection and uses associated
> Last.FM tags to cluster files into 'genres'. Unfortunately, I'm having
> some trouble with my algorithm and some tracks are ending up in multiple
> clusters. I have a feeling this is because of (my lack of understanding
> of) list behaviours within Python. This is all purely a learning
> exercise, so any input would be greatly appreciated
>
> The actual clustering method can be seen described here. Each Cluster is
> stored within the 'clusters' array and has two elements: the data
> containing song metadata such as artist, title and most importantly
> tags, and a weight: that is, the overall Euclidean weight of the cluster
> (based on the song data within).
>
> In theory, the algorithm runs in a loop until all clusters have been
> merged into a hierarchy. It takes the weight of each cluster and merges
> each cluster with their closest counterpart. My code has a lot of
> comments so it should be fairly easy to understand.
> tmp = []
>
> for c in self.clusters:
>
> closestCluster = None
> closestDelta = float('inf')
>
> for c2 in self.clusters:
>
> #skip if this is the same node
> if(c == c2):
> continue
>
> delta = abs(c2['weight'] - c['weight'])
>
> if(delta < closestDelta):
> closestCluster = c2
> closestDelta = delta
>
>
> print "Merging clusters %(c1)d and %(c2)d" % {'c1' :
> self.clusters.index(c),
> 'c2' :
> self.clusters.index(closestCluster)}
> #now merge the two clusters
> self.clusters.remove(closestCluster)
> self.clusters.remove(c)
>
> c['data'].extend(closestCluster['data'])
>
> tmp.append(c)
I can't run your code because you didn't make it standalone, but I believe
that the problem (at least one of them) is that you iterate over
self.clusters and modify it from within the loop. That's a big no-no in
python. A simple example to demonstrate the effects:
>>> import random
>>> old = range(10)
>>> new = []
>>> for item in old:
... closest = random.choice(old)
... new.append((item, closest))
... old.remove(item)
... old.remove(closest)
...
>>> old
[3, 4]
>>> new
[(0, 8), (2, 1), (5, 7), (9, 6)]
The remedy is normally to iterate over a copy
for item in list(old):
...
but in your case that is probably not enough.
Try something along these lines:
# untested
while len(self.clusters) > 1:
c = self.clusters.pop()
# find closest index
for i, c2 in enumerate(self.clusters):
...
if ...:
closest_index = i
closest = self.clusters.pop(closest_index)
tmp.append(c + closest)
if self.clusters:
tmp.append(self.clusters[0])
Peter
More information about the Python-list
mailing list