[Numpy-discussion] Efficient square distance computation

Ke Sun sunk.cs at gmail.com
Tue Oct 8 04:06:25 EDT 2013

Dear all,

I have written the following function to compute the square distances of a large
matrix (each sample a row). It compute row by row and print the overall progress.
The progress output is important and I didn't use matrix multiplication.

I give as input a 70,000x800 matrix. The output should be a 70,000x70,000
matrix. The program runs really slow (16 hours for 1/3 progress). And it eats
36G memory (fortunately I have enough). 

Could you give some insights on how to modify the code to be efficient and
to eat less memory?

Ke Sun

def dist2_large( data ):
    import time
    if data.ndim != 2: raise RuntimeError( "data should be a matrix" )
    N,D = data.shape

    print 'using the sample-wise implementation'
    print '%d samples, %d dimensions' % (N,D)

    start_t = time.time()
    d2 = np.zeros( [N,N] )
    for i in range( N ):
        print "\r%5d/%d" % (i+1, N), 
        for j in range( N ):
            d2[i,j] = ((data[i] - data[j])**2).sum()

    total_t = time.time() - start_t
    hours = (total_t / 3600)
    minutes = (total_t % 3600) / 60
    print "\nfinished in %2dh%2dm" % (hours, minutes)

    return d2

More information about the NumPy-Discussion mailing list