binary file compare...

Nigel Rantor wiggly at wiggly.org
Wed Apr 15 13:04:18 EDT 2009


Martin wrote:
> On Wed, Apr 15, 2009 at 11:03 AM, Steven D'Aprano
> <steven at remove.this.cybersource.com.au> wrote:
>> The checksum does look at every byte in each file. Checksumming isn't a
>> way to avoid looking at each byte of the two files, it is a way of
>> mapping all the bytes to a single number.
> 
> My understanding of the original question was a way to determine
> wether 2 files are equal or not. Creating a checksum of 1-n files and
> comparing those checksums IMHO is a valid way to do that. I know it's
> a (one way) mapping between a (possibly) longer byte sequence and
> another one, how does checksumming not take each byte in the original
> sequence into account.

The fact that two md5 hashes are equal does not mean that the sources 
they were generated from are equal. To do that you must still perform a 
byte-by-byte comparison which is much less work for the processor than 
generating an md5 or sha hash.

If you insist on using a hashing algorithm to determine the equivalence 
of two files you will eventually realise that it is a flawed plan 
because you will eventually find two files with different contents that 
nonetheless hash to the same value.

The more files you test with the quicker you will find out this basic truth.

This is not complex, it's a simple fact about how hashing algorithms work.

   n




More information about the Python-list mailing list