binary file compare...
wiggly at wiggly.org
Fri Apr 17 12:49:09 CEST 2009
Adam Olsen wrote:
> On Apr 16, 4:27 pm, "Rhodri James" <rho... at wildebst.demon.co.uk>
>> On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen <rha... at gmail.com> wrote:
>>> On Apr 16, 3:16 am, Nigel Rantor <wig... at wiggly.org> wrote:
>>>> Okay, before I tell you about the empirical, real-world evidence I have
>>>> could you please accept that hashes collide and that no matter how many
>>>> samples you use the probability of finding two files that do collide is
>>>> small but not zero.
>>> I'm afraid you will need to back up your claims with real files.
>> So that would be a "no" then. If the implementation of dicts in Python,
>> say, were to assert as you are that the hashes aren't going to collide,
>> then I'd have to walk away from it. There's no point in using something
>> that guarantees a non-zero chance of corrupting your data.
> Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or
> 65 thousand) will give you a decent chance of a collision. In
> contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or
> 2**256 (SHA-512). The two are at totally different extremes.
I'm just going to go ahead and take the above as an admission by you
that the chance of collision is non-zero, and that if we accept that
fact you cannot rely on a hash function to tell you if two files are
> There is *always* a non-zero chance of corruption, due to software
> bugs, hardware defects, or even operator error. It is only in that
> broader context that you can realize just how minuscule the risk is.
Please explain why you're dragging the notion of corruption into this
when it seems to be beside the point?
> Can you explain to me why you justify great lengths of paranoia, when
> the risk is so much lower?
Because in the real world, where I work, in practical, real, factual
terms I have seen it happen. Not once. Not twice. But many, many times.
>> Why are you advocating a solution to the OP's problem that is more
>> computationally expensive than a simple byte-by-byte comparison and
>> doesn't guarantee to give the correct answer?
> For single, one-off comparison I have no problem with a byte-by-byte
> comparison. There's a decent chance the files won't be in the OS's
> cache anyway, so disk IO will be your bottleneck.
> Only if you're doing multiple comparisons is a hash database
> justified. Even then, if you expect matching files to be fairly rare
> I won't lose any sleep if you're paranoid and do a byte-by-byte
> comparison anyway. New vulnerabilities are found, and if you don't
> update promptly there is a small (but significant) chance of a
> malicious file leading to collision.
If I have a number of files then I would certainly use a hash as a quick
test, but if two files hash to the same value I still have to go compare
them. Hashing in this case saves time doing a comparison for each file
but it doesn't obviate the need to do a byte-by-byte comparison to see
if the files that hash to the same value are actually the same.
> That's not my concern though. What I'm responding to is Nigel
> Rantor's grossly incorrect statements about probability. The chance
> of collision, in our life time, is *insignificant*.
Please tell me which statements? That the chance of two files hashing to
the same value is non-zero? You admit as much above.
Also, please remember I gave you a way of verifying what I said, go
crawl the web for images, pages, whatever, build a hash DB and tell me
how long it takes you to find a collision using MD5 (which is the hash I
was talking about when I told you I had real-world, experience to back
up the theoretical claim that collisions occur.
More information about the Python-list