[perl-python] a program to delete duplicate files
eppstein at ics.uci.edu
Fri Mar 11 20:07:02 CET 2005
In article <4231c703$1 at news.vo.lu>,
Patrick Useldinger <pu.news.001 at gmail.com> wrote:
> > You need do no comparisons between files. Just use a sufficiently
> > strong hash algorithm (SHA-256 maybe?) and compare the hashes.
> That's not very efficient. IMO, it only makes sense in network-based
> operations such as rsync.
Well, but the spec didn't say efficiency was the primary criterion, it
said minimizing the number of comparisons was.
More seriously, the best I can think of that doesn't use a strong slow
hash would be to group files by (file size, cheap hash) then compare
each file in a group with a representative of each distinct file found
among earlier files in the same group -- that leads to an average of
about three reads per duplicated file copy: one to hash it, and two for
the comparison between it and its representative (almost all of the
comparisons will turn out equal but you still need to check unless you
use a strong hash).
I'm assuming of course that there are too many files and/or they're too
large just to keep them all in core.
Anyone have any data on whether reading files and SHA-256'ing them (or
whatever other cryptographic hash you think is strong enough) is
I/O-bound or CPU-bound? That is, is three reads with very little CPU
overhead really cheaper than one read with a strong hash?
I guess it also depends on the number of files you expect to have
duplicates of. If most of the files exist in only one copy, it's clear
that the cheap hash will find them more cheaply than the expensive hash.
In that case you could combine the (file size, cheap hash) filtering
with the expensive hash and get only two reads per copy rather than
Computer Science Dept., Univ. of California, Irvine
More information about the Python-list