a program to delete duplicate files

François Pinard pinard at iro.umontreal.ca
Sat Mar 12 16:57:59 CET 2005

[Patrick Useldinger]

> Shouldn't you add the additional comparison time that has to be done
> after hash calculation? Hashes do not give 100% guarantee. If there's
> a large number of identical hashes, you'd still need to read all of
> these files to make sure.

Identical hashes for different files?  The probability of this happening
should be extremely small, or else, your hash function is not a good one.

I once was over-cautious about relying on hashes only, without actually
comparing files.  A friend convinced me, doing maths, that with a good
hash function, the probability of a false match was much, much smaller
than the probability of my computer returning the wrong answer, despite
thorough comparisons, due to some electronic glitch or cosmic ray.  So,
my cautious attitude was by far, for all practical means, a waste.

Similar arguments apply, say, for the confidence we may have in
probabilistic algorithms for the determination of number primality.

François Pinard   http://pinard.progiciels-bpi.ca

More information about the Python-list mailing list