[perl-python] a program to delete duplicate files

Fri Mar 11 14:07:02 EST 2005

In article <4231c703$1 at news.vo.lu>,
 Patrick Useldinger <pu.news.001 at gmail.com> wrote:

> > You need do no comparisons between files.  Just use a sufficiently 
> > strong hash algorithm (SHA-256 maybe?) and compare the hashes.
> 
> That's not very efficient. IMO, it only makes sense in network-based 
> operations such as rsync.

Well, but the spec didn't say efficiency was the primary criterion, it 
said minimizing the number of comparisons was.

More seriously, the best I can think of that doesn't use a strong slow 
hash would be to group files by (file size, cheap hash) then compare 
each file in a group with a representative of each distinct file found 
among earlier files in the same group -- that leads to an average of 
about three reads per duplicated file copy: one to hash it, and two for 
the comparison between it and its representative (almost all of the 
comparisons will turn out equal but you still need to check unless you 
use a strong hash).

I'm assuming of course that there are too many files and/or they're too 
large just to keep them all in core.

Anyone have any data on whether reading files and SHA-256'ing them (or 
whatever other cryptographic hash you think is strong enough) is 
I/O-bound or CPU-bound?  That is, is three reads with very little CPU 
overhead really cheaper than one read with a strong hash?

I guess it also depends on the number of files you expect to have 
duplicates of.  If most of the files exist in only one copy, it's clear 
that the cheap hash will find them more cheaply than the expensive hash.  
In that case you could combine the (file size, cheap hash) filtering 
with the expensive hash and get only two reads per copy rather than 
three.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/