[Numpy-discussion] Faster
Keith Goodman
kwgoodman at gmail.com
Sun May 4 11:46:11 EDT 2008
On Sun, May 4, 2008 at 8:14 AM, Hoyt Koepke <hoytak at gmail.com> wrote:
> > and then update the others only if the row you're updating has that
> > minimum value in it. Then, when scanning for the min dist, you only
> > need to scan O(n) rows.
>
> Sorry, let me clarify -- Update the entries corresponding to entries
> in the row you're updating if they are the same as the minimum
> distance; in that case you need to rescan the row. Also, ySorry, ou
> only need to scan O(n) entries in your cached array.
If I understand your improvement, I can speed up dist.min() in
i, j = np.where(dist == dist.min())
return i[0], j[0]
since it is faster to find the min in a n-element array than in a nxn
array. But the new code, thanks to Robert, is
ij = x.argmin()
i, j = divmod(ij, N)
Would a 1d array of the column minimums still help?
On a separate note, the distance matrix is symmetric so I could fill
the lower half of the distance matrix with large values and, after
updating a column skip the step of keeping the matrix symmetric by
copying the column into the row (d[i,:] = d[:,i]).
Does d[i,:] = d[:,i] make a copy?
Here's what I have so far. It's fast for my needs.
import time
import numpy as np
class Cluster:
"Single linkage hierarchical clustering"
def __init__(self, dist, label=None, linkage='single', debug=False):
"""
dist Distance matrix, NxN numpy array
label Labels of each row of the distance matrix, list of N items,
default is range(N)
"""
assert dist.shape[0] == dist.shape[1], 'dist must be square (nxn)'
assert (np.abs(dist - dist.T) < 1e-8).all(), 'dist must be symmetric'
if label is None:
label = range(dist.shape[0])
assert dist.shape[0] == len(label), 'dist and label must match in size'
msg = 'linkage must be single or complete'
assert linkage in ('single', 'complete'), msg
self.c = [[[z] for z in label]]
self.label = label
self.linkage = linkage
self.dist = dist
self.debug = debug
def run(self):
for level in xrange(len(self.label) - 1):
i, j = self.min_dist()
self.join(i, j)
def join(self, i, j):
assert i != j, 'Cannot combine a cluster with itself'
# Join labels
new = list(self.c[-1])
new[i] = new[i] + new[j]
new.pop(j)
self.c.append(new)
# Join distance matrix
if self.linkage == 'single':
self.dist[:,i] = self.dist[:,[i,j]].min(1)
elif self.linkage == 'complete':
self.dist[:,i] = self.dist[:,[i,j]].max(1)
else:
raise NotImplementedError, 'Unknown linkage method'
self.dist[i,:] = self.dist[:,i]
# A faster verion of this code...
# idx = range(self.dist.shape[1])
# idx.remove(j)
# self.dist = self.dist[:,idx]
# self.dist = self.dist[idx,:]
# ...is this...
out = self.dist[:-1,:-1]
out[:i,i:] = self.dist[:i,i+1:]
out[i:,:i] = self.dist[i+1:,:i]
out[i:,i:] = self.dist[i+1:,i+1:]
self.dist = out
# Debug output
if self.debug:
print
print len(self.c) - 1
print 'Clusters'
print self.c[-1]
print 'Distance'
print self.dist
def min_dist(self):
# A faster version of this code...
# dist = self.dist + 1e10 * np.eye(self.dist.shape[0])
# i, j = np.where(dist == dist.min())
# return i[0], j[0]
# ...is this:
x = self.dist
N = x.shape[0]
# With complete linkage the min distance was sometimes on the diagonal
# I think it occured on the last merge (one cluster). So I added + 1.
x.flat[::N+1] = x.max() + 1
ij = x.argmin()
i, j = divmod(ij, N)
return i, j
def test():
# Example from
# home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html
label = ['BA', 'FI', 'MI', 'NA', 'RM', 'TO']
dist = np.array([[0, 662, 877, 255, 412, 996],
[662, 0, 295, 468, 268, 400],
[877, 295, 0, 754, 564, 138],
[255, 468, 754, 0, 219, 869],
[412, 268, 564, 219, 0, 669],
[996, 400, 138, 869, 669, 0 ]])
clust = Cluster(dist, label, linkage='single', debug=True)
clust.run()
def test2(n):
x = np.random.rand(n,n)
x = x + x.T
c = Cluster(x)
t1 = time.time()
c.run()
t2 = time.time()
print 'n = %d took %0.2f seconds' % (n, t2-t1)
More information about the NumPy-Discussion
mailing list