binary file compare...

Piet van Oostrum piet at
Sat Apr 18 10:19:21 CEST 2009

>>>>> Adam Olsen <rhamph at> (AO) wrote:

>AO> The Wayback Machine has 150 billion pages, so 2**37.  Google's index
>AO> is a bit larger at over a trillion pages, so 2**40.  A little closer
>AO> than I'd like, but that's still 562949950000000 to 1 odds of having
>AO> *any* collisions between *any* of the files.  Step up to SHA-256 and
>AO> it becomes 191561940000000000000000000000000000000000000000000000 to
>AO> 1.  Sadly, I can't even give you the odds for SHA-512, Qalculate
>AO> considers that too close to infinite to display. :)

>AO> You should worry more about your head spontaneously exploding than you
>AO> should about a hash collision on that scale.  To do otherwise is
>AO> irrational paranoia.

And that is the probability if there being two files in that huge
collection with the same hash. If you just take two files, not
fabricated to collide, the probability of them having the same hash
under MD5 is 2**-128 which I think is way smaller than the probability
of the bit representing the answer being swapped by some physical cause
in your computer. But then again, it doesn't make sense to use that
instead of byte-by-byte comparison if the files are on the same machine.
Piet van Oostrum <piet at>
URL: [PGP 8DAE142BE17999C4]
Private email: piet at

More information about the Python-list mailing list