Why checksum? [was Re: Fuzzy Lookups]
Tom Anderson
twic at urchin.earth.li
Wed Feb 1 05:37:57 EST 2006
On Tue, 31 Jan 2006, it was written:
> Steven D'Aprano <steve at REMOVETHIScyber.com.au> writes:
>
>> This isn't a criticism, it is a genuine question. Why do people compare
>> local files with MD5 instead of doing a byte-to-byte compare?
I often wonder that!
>> Is it purely a caching thing (once you have the checksum, you don't
>> need to read the file again)? Are there any other reasons?
>
> It's not just a matter of comparing two files. The idea is you have
> 10,000 local files and you want to find which ones are duplicates (i.e.
> if files 637 and 2945 have the same contents, you want to discover
> that). The obvious way is make a list of hashes, and sort the list.
Obvious, perhaps, prudent, no. To make the list of hashes, you have to
read all of every single file first, which could take a while. If your
files are reasonably random at the beginning, you'd be better off just
using the first N bytes of the file, since this would be just as
effective, and cheaper to read. Looking at some random MP3s i have to
hand, they all differ within the first 20 bytes - probably due to the ID3
tags, so this should work for these.
Better yet, note that if two files are identical, they must have the same
length, and that finding the length can be done very cheaply, so a quicker
yet approach is to make a list of lengths, sort that, and look for
duplicates; when you find one, do a byte-by-byte comparison of the files
(probably terminating in the first block) to see if they really are the
same.
By way of example, of the 2690 music files in my iTunes library, i have
twelve pairs of same-sized files [1], and all of these differ within the
first 18 bytes (mostly, within the first 9 bytes). Therefore, i could rule
out duplication with just 22 data blocks read from disk (plus rather more
blocks of directory information and inodes, of course). A hash-based
approach would have had to wade through a touch over 13 GB of data before
it could even get started.
Of course, there are situations where this is the wrong approach - if you
have a collection of serialised sparse matrices, for example, which
consist of identically-sized blocks of zeroes with a scattering of ones
throughout, then lengths and prefixes will be useless, whereas hashes will
work perfectly. However, here, we're looking at MP3s, where lengths and
prefixes will be a win.
tom
[1] The distribution of those is a bit weird: ten pairs consist of two
tracks from The Conet Project's 'Recordings of Shortwave Numbers
Stations', one is a song from that and The Doors' 'Horse Latitudes', and
one is between to Calexico songs ('The Ride (Pt II)' and 'Minas De
Cobre'). Why on earth are eleven of the twelve pairs pairs of songs from
the same artist? Is it really that they're pairs of songs from the same
compressor (those tracks aren't from CD), i wonder?
--
Not all legislation can be eye-catching, and it is important that the
desire to achieve the headlines does not mean that small but useful
measures are crowded out of the legislative programme. -- Select Committee
on Transport
More information about the Python-list
mailing list