A sets algorithm
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Mon Feb 8 23:13:07 EST 2016
On Tuesday 09 February 2016 02:11, Chris Angelico wrote:
> That's fine for comparing one file against one other. He started out
> by saying he already had a way to compare files for equality. What he
> wants is a way to capitalize on that to find all the identical files
> in a group. A naive approach would simply compare every file against
> every other, for O(N*N) comparisons - but a hash lookup can make that
> O(N) on the files themselves, plus (I think) an O(N log N) hash
> comparison job, which has much lower constant factors.
You're mixing up N's there :-) In the first two (compare every file against
every other file, and hash lookup), N stands for the number of files. But in
the hash comparison, well, I'm not sure what you mean by that, unless you
mean the calculation of the hash itself, which will be O(M) where M is the
size of the file.
Unfortunately, even using a hash, you still have to do O(N**2) work, or
rather, O(N*M) where N is the number of files and M is their size.
Assuming that duplicate files are relatively rare, the best way to do this
is to collate files with the same size. Looking up the file size ought to be
pretty fast. It is an IO operation, but it is approximately constant time in
the sense that it doesn't actually depend on the size of the file.
(If you're using a programming language or operating system where the only
way to find out how big a file is to open it and count the bytes, this
obviously won't work for you.)
So let's collect files into a separate bucket for each unique file size:
bysize = {}
for file in files:
bysize.setdefault(file.getsize(), []).append(file)
In practice, you may find that nearly all files have different sizes, and
most of the bucks have only a single file in them. Those files you can just
ignore, as they must be unique.
So instead of having to compare N files against each of the others, you are
left with a much smaller number of buckets, each with (hopefully) only two
or three files of the same size. You might have gone from N=1000000 to
thirty buckets with five files each.
And most of those files will probably differ on the very first byte, or at
least within the first 16K of bytes, so it's hardly worth hashing them
(especially if the hash algorithm is an expensive cryptographic hash, or if
the file is large).
I would delay the hash comparison as long as possible, something like this:
def __hash__(self):
if self._hash is None:
self._hash = calculate_hash() # expensive, so we cache it
return self._hash
def __eq__(self, other):
size = self.getsize()
if size != other.getsize():
return False
if self._hash and other._hash:
# Both hashes are already done, so might as well use them.
return (
hash(self) == hash(other)
# in case of collisions...
and self.read() == other.read()
)
if size < MAXSIZE:
return self.read(size) == other.read(size)
else:
if self.read(MAXSIZE) != other.read(MAXSIZE):
return False
else:
return (
hash(self) == hash(other)
# in case of collisions...
and self.read() == other.read()
)
If your hash is strong enough that collisions are vanishingly rare, you
could ignore the `and self.read() == other.read()` check.
--
Steve
More information about the Python-list
mailing list