a program to delete duplicate files

Sat Mar 12 11:58:15 EST 2005

François Pinard wrote:

> Identical hashes for different files?  The probability of this happening
> should be extremely small, or else, your hash function is not a good one.

We're talking about md5, sha1 or similar. They are all known not to be 
100% perfect. I agree it's a rare case, but still, why settle on 
something "about right" when you can have "right"?

> I once was over-cautious about relying on hashes only, without actually
> comparing files.  A friend convinced me, doing maths, that with a good
> hash function, the probability of a false match was much, much smaller
> than the probability of my computer returning the wrong answer, despite
> thorough comparisons, due to some electronic glitch or cosmic ray.  So,
> my cautious attitude was by far, for all practical means, a waste.

It was not my only argument for not using hashed. My algorithm also does 
less reads, for example.

-pu