binary file compare...
tim.wintle at teamrubber.com
Fri Apr 17 13:30:49 CEST 2009
On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote:
> The Wayback Machine has 150 billion pages, so 2**37. Google's index
> is a bit larger at over a trillion pages, so 2**40. A little closer
> than I'd like, but that's still 562949950000000 to 1 odds of having
> *any* collisions between *any* of the files. Step up to SHA-256 and
> it becomes 191561940000000000000000000000000000000000000000000000 to
> 1. Sadly, I can't even give you the odds for SHA-512, Qalculate
> considers that too close to infinite to display. :)
That might be true as long as your data is completely uniformly
distributed. For the example you give there's:
a) a high chance that there's "<html>" near the top
b) a non-uniform distribution of individual words within the text.
c) a non-unifom distribution of all n-grams within the text (as there is
in natural language)
So it's very far from uniformly distributed. Just about the only
situation where I could imagine that holding would be where you are
hashing uniformly random data for the sake of testing the hash.
I believe the point being made is that comparing hash values is a
probabilistic algorithm anyway, which is fine if you're ok with that,
but for mission critical software it's crazy.
More information about the Python-list