a program to delete duplicate files

Tue Mar 15 01:14:16 EST 2005

In article <871xaisdqz.fsf at pobox.com>, jjl at pobox.com (John J. Lee) 
wrote:

> > If you read them in parallel, it's _at most_ m (m is the worst case
> > here), not 2(m-1). In my tests, it has always significantly less than
> > m.
> 
> Hmm, Patrick's right, David, isn't he?

Yes, I was only considering pairwise comparisons. As he says, 
simultaneously comparing all files in a group would avoid repeated reads 
without the CPU overhead of a strong hash.  Assuming you use a system 
that allows you to have enough files open at once...

> And I'm not sure what the trade off between disk seeks and disk reads
> does to the problem, in practice (with caching and realistic memory
> constraints).

Another interesting point.

-- 
David Eppstein
Computer Science Dept., Univ. of California, Irvine
http://www.ics.uci.edu/~eppstein/