[Numpy-discussion] Efficient square distance computation
sunk.cs at gmail.com
Tue Oct 8 04:06:25 EDT 2013
I have written the following function to compute the square distances of a large
matrix (each sample a row). It compute row by row and print the overall progress.
The progress output is important and I didn't use matrix multiplication.
I give as input a 70,000x800 matrix. The output should be a 70,000x70,000
matrix. The program runs really slow (16 hours for 1/3 progress). And it eats
36G memory (fortunately I have enough).
Could you give some insights on how to modify the code to be efficient and
to eat less memory?
def dist2_large( data ):
if data.ndim != 2: raise RuntimeError( "data should be a matrix" )
N,D = data.shape
print 'using the sample-wise implementation'
print '%d samples, %d dimensions' % (N,D)
start_t = time.time()
d2 = np.zeros( [N,N] )
for i in range( N ):
print "\r%5d/%d" % (i+1, N),
for j in range( N ):
d2[i,j] = ((data[i] - data[j])**2).sum()
total_t = time.time() - start_t
hours = (total_t / 3600)
minutes = (total_t % 3600) / 60
print "\nfinished in %2dh%2dm" % (hours, minutes)
More information about the NumPy-Discussion