a program to delete duplicate files
pu.news.001 at gmail.com
Sat Mar 12 13:01:39 CET 2005
John Machin wrote:
> Just look at the efficiency of processing N files of the same size S,
> where they differ after d bytes: [If they don't differ, d = S]
> PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d
> which is important for small N and large d].
> Hashing method: O(NS) reading time, O(NS) hash calc time
Shouldn't you add the additional comparison time that has to be done
after hash calculation? Hashes do not give 100% guarantee. If there's a
large number of identical hashes, you'd still need to read all of these
files to make sure.
Just to explain why I appear to be a lawer: everybody I spoke to about
this program told me to use hashes, but nobody has been able to explain
why. I found myself 2 possible reasons:
1) it's easier to program: you don't compare several files in parallel,
but process one by one. But it's not perfect and you still need to
compare afterwards. In the worst case, you end up with 3 files with
identical hashes, of which 2 are identical and 1 is not. In order to
find this, you'd still have to program the algorithm I use, unless you
say "oh well, there's a problem with the hash, go and look yourself."
2) it's probably useful if you compare files over a network and you want
to reduce bandwidth. A hash lets you do that at the cost of local CPU
and disk usage, which may be OK. That was not my case.
More information about the Python-list