a program to delete duplicate files

Sat Mar 12 07:01:39 EST 2005

John Machin wrote:

> Just look at the efficiency of processing N files of the same size S,
> where they differ after d bytes: [If they don't differ, d = S]
> 
> PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
> which is important for small N and large d].
> 
> Hashing method: O(NS) reading time, O(NS) hash calc time

Shouldn't you add the additional comparison time that has to be done 
after hash calculation? Hashes do not give 100% guarantee. If there's a 
large number of identical hashes, you'd still need to read all of these 
files to make sure.

Just to explain why I appear to be a lawer: everybody I spoke to about 
this program told me to use hashes, but nobody has been able to explain 
why. I found myself 2 possible reasons:

1) it's easier to program: you don't compare several files in parallel, 
but process one by one. But it's not perfect and you still need to 
compare afterwards. In the worst case, you end up with 3 files with 
identical hashes, of which 2 are identical and 1 is not. In order to 
find this, you'd still have to program the algorithm I use, unless you 
say "oh well, there's a problem with the hash, go and look yourself."

2) it's probably useful if you compare files over a network and you want 
to reduce bandwidth. A hash lets you do that at the cost of local CPU 
and disk usage, which may be OK. That was not my case.

-pu