[perl-python] a program to delete duplicate files

Fri Mar 11 16:03:22 EST 2005

David Eppstein wrote:

> Well, but the spec didn't say efficiency was the primary criterion, it 
> said minimizing the number of comparisons was.

That's exactly what my program does.

> More seriously, the best I can think of that doesn't use a strong slow 
> hash would be to group files by (file size, cheap hash) then compare 
> each file in a group with a representative of each distinct file found 
> among earlier files in the same group -- that leads to an average of 
> about three reads per duplicated file copy: one to hash it, and two for 
> the comparison between it and its representative (almost all of the 
> comparisons will turn out equal but you still need to check unless you 

My point is : forget hashes. If you work with hashes, you do have to 
read each file completely, cheap hash or not. My program normally reads 
*at most* 100% of the files to analyse, but usually much less. Also, I 
do plain comparisons which are much cheaper than hash calculations.

> I'm assuming of course that there are too many files and/or they're too 
> large just to keep them all in core.

I assume that file handles are sufficient to keep one open per file of 
the same size. This lead to trouble on Windows installations, but I 
guess that's a parameter to change. On Linux, I never had the problem.

Regarding buffer size, I use a maxumim which is then split up between 
all open files.

> Anyone have any data on whether reading files and SHA-256'ing them (or 
> whatever other cryptographic hash you think is strong enough) is 
> I/O-bound or CPU-bound?  That is, is three reads with very little CPU 
> overhead really cheaper than one read with a strong hash?

It also depends on the OS. I found that my program runs much slower on 
Windows, probably due to the way Linux anticipates reads and tries to 
reduce head movement.

> I guess it also depends on the number of files you expect to have 
> duplicates of.  If most of the files exist in only one copy, it's clear 
> that the cheap hash will find them more cheaply than the expensive hash.  
> In that case you could combine the (file size, cheap hash) filtering 
> with the expensive hash and get only two reads per copy rather than 
> three.

Sorry, but I can still not see a point tu use hashes. Maybe you'll have 
a look at my program and tell me where a hash could be useful?

It's available at http://www.homepages.lu/pu/fdups.html.

Regards,
-pu