binary file compare...

Fri Apr 17 06:49:09 EDT 2009

Adam Olsen wrote:
> On Apr 16, 4:27 pm, "Rhodri James" <rho... at wildebst.demon.co.uk>
> wrote:
>> On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen <rha... at gmail.com> wrote:
>>> On Apr 16, 3:16 am, Nigel Rantor <wig... at wiggly.org> wrote:
>>>> Okay, before I tell you about the empirical, real-world evidence I have
>>>> could you please accept that hashes collide and that no matter how many
>>>> samples you use the probability of finding two files that do collide is
>>>> small but not zero.
>>> I'm afraid you will need to back up your claims with real files.
>> So that would be a "no" then.  If the implementation of dicts in Python,
>> say, were to assert as you are that the hashes aren't going to collide,
>> then I'd have to walk away from it.  There's no point in using something
>> that guarantees a non-zero chance of corrupting your data.
> 
> Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
> 65 thousand) will give you a decent chance of a collision.  In
> contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
> 2**256 (SHA-512).  The two are at totally different extremes.

I'm just going to go ahead and take the above as an admission by you 
that the chance of collision is non-zero, and that if we accept that 
fact you cannot rely on a hash function to tell you if two files are 
identical.

Thanks.

> There is *always* a non-zero chance of corruption, due to software
> bugs, hardware defects, or even operator error.  It is only in that
> broader context that you can realize just how minuscule the risk is.

Please explain why you're dragging the notion of corruption into this 
when it seems to be beside the point?

> Can you explain to me why you justify great lengths of paranoia, when
> the risk is so much lower?

Because in the real world, where I work, in practical, real, factual 
terms I have seen it happen. Not once. Not twice. But many, many times.

>> Why are you advocating a solution to the OP's problem that is more
>> computationally expensive than a simple byte-by-byte comparison and
>> doesn't guarantee to give the correct answer?
> 
> For single, one-off comparison I have no problem with a byte-by-byte
> comparison.  There's a decent chance the files won't be in the OS's
> cache anyway, so disk IO will be your bottleneck.
 >
> Only if you're doing multiple comparisons is a hash database
> justified.  Even then, if you expect matching files to be fairly rare
> I won't lose any sleep if you're paranoid and do a byte-by-byte
> comparison anyway.  New vulnerabilities are found, and if you don't
> update promptly there is a small (but significant) chance of a
> malicious file leading to collision.

If I have a number of files then I would certainly use a hash as a quick 
test, but if two files hash to the same value I still have to go compare 
them. Hashing in this case saves time doing a comparison for each file 
but it doesn't obviate the need to do a byte-by-byte comparison to see 
if the files that hash to the same value are actually the same.

> That's not my concern though.  What I'm responding to is Nigel
> Rantor's grossly incorrect statements about probability.  The chance
> of collision, in our life time, is *insignificant*.

Please tell me which statements? That the chance of two files hashing to 
the same value is non-zero? You admit as much above.

Also, please remember I gave you a way of verifying what I said, go 
crawl the web for images, pages, whatever, build a hash DB and tell me 
how long it takes you to find a collision using MD5 (which is the hash I 
was talking about when I told you I had real-world, experience to back 
up the theoretical claim that collisions occur.

Regards,

   Nigel