Why is hamming distance so slow?
Hi, I am trying to calculate the pairwise hamming distances among a large set of boolean arrays. I would think that hamming distance would be one of the fastest distance metrics, but it seems to be much slower than alternatives. For instance I generate a 1000x1000 test boolean matrix| test = rng.rand(1000,1000) > 0.5 |Then I try to calculate the pairwise distances, e.g.| pdist(test, 'hamming') | I can run this in an average of 3.08s on my laptop, Whereas if I change /hamming/ to /sqeuclidean/ this goes down to 0.56s. These will effectively give me the same result (with the normalization factor), but I am quite surprised to see /hamming/ be so slow. In fact, computing /sqeuclidean/ distance on a real valued matrix of the same shape also runs in less than 0.6s Why should /hamming/, which is just doing bitwise /xor/ take so much longer than computing /sqeuclidean/ distances on real valued arrays? And, is there a recommended way to optimize this? Cheers, Josh
participants (1)
-
Joshua Auerbach