a program to delete duplicate files
eppstein at ics.uci.edu
Mon Mar 14 19:43:23 CET 2005
In article <1110617519.413704.237830 at g14g2000cwa.googlegroups.com>,
"John Machin" <sjmachin at lexicon.net> wrote:
> Just look at the efficiency of processing N files of the same size S,
> where they differ after d bytes: [If they don't differ, d = S]
I think this misses the point. It's easy to find the files that are
different. Just a file size comparison will get most of them, and most
of the rest can be detected in the first few kbytes. So I think it's
safe to assume both of these filtering mechanisms would be incorporated
into a good duplicate detection code, and that any remaining tests that
would be performed are between files that are very likely to be the same
as each other.
The hard part is verifying that the files that look like duplicates
really are duplicates. To do so, for a group of m files that appear to
be the same, requires 2(m-1) reads through the whole files if you use a
comparison based method, or m reads if you use a strong hashing method.
You can't hope to cut the reads off early when using comparisons,
because the files won't be different.
The question to me is: when you have a group of m>2 likely-to-be-equal
files, is 2(m-1) reads and very little computation better than m reads
and a strong hash computation? I haven't yet seen an answer to this,
but it's a question for experimentation rather than theory.
Computer Science Dept., Univ. of California, Irvine
More information about the Python-list