[perl-python] a program to delete duplicate files
pu.news.001 at gmail.com
Fri Mar 11 22:03:22 CET 2005
David Eppstein wrote:
> Well, but the spec didn't say efficiency was the primary criterion, it
> said minimizing the number of comparisons was.
That's exactly what my program does.
> More seriously, the best I can think of that doesn't use a strong slow
> hash would be to group files by (file size, cheap hash) then compare
> each file in a group with a representative of each distinct file found
> among earlier files in the same group -- that leads to an average of
> about three reads per duplicated file copy: one to hash it, and two for
> the comparison between it and its representative (almost all of the
> comparisons will turn out equal but you still need to check unless you
My point is : forget hashes. If you work with hashes, you do have to
read each file completely, cheap hash or not. My program normally reads
*at most* 100% of the files to analyse, but usually much less. Also, I
do plain comparisons which are much cheaper than hash calculations.
> I'm assuming of course that there are too many files and/or they're too
> large just to keep them all in core.
I assume that file handles are sufficient to keep one open per file of
the same size. This lead to trouble on Windows installations, but I
guess that's a parameter to change. On Linux, I never had the problem.
Regarding buffer size, I use a maxumim which is then split up between
all open files.
> Anyone have any data on whether reading files and SHA-256'ing them (or
> whatever other cryptographic hash you think is strong enough) is
> I/O-bound or CPU-bound? That is, is three reads with very little CPU
> overhead really cheaper than one read with a strong hash?
It also depends on the OS. I found that my program runs much slower on
Windows, probably due to the way Linux anticipates reads and tries to
reduce head movement.
> I guess it also depends on the number of files you expect to have
> duplicates of. If most of the files exist in only one copy, it's clear
> that the cheap hash will find them more cheaply than the expensive hash.
> In that case you could combine the (file size, cheap hash) filtering
> with the expensive hash and get only two reads per copy rather than
Sorry, but I can still not see a point tu use hashes. Maybe you'll have
a look at my program and tell me where a hash could be useful?
It's available at http://www.homepages.lu/pu/fdups.html.
More information about the Python-list