binary file compare...
Grant Edwards
invalid at invalid
Thu Apr 16 10:59:19 EDT 2009
On 2009-04-16, Adam Olsen <rhamph at gmail.com> wrote:
>>>>> The chance of *accidentally* producing a collision, although
>>>>> technically possible, is so extraordinarily rare that it's
>>>>> completely overshadowed by the risk of a hardware or software
>>>>> failure producing an incorrect result.
>>>>
>>>> Not when you're using them to compare lots of files.
>>
>>>> Trust me. Been there, done that, got the t-shirt.
I must admit, that is a bit hard to believe.
>> Okay, before I tell you about the empirical, real-world
>> evidence I have could you please accept that hashes collide
>> and that no matter how many samples you use the probability of
>> finding two files that do collide is small but not zero.
>
> I'm afraid you will need to back up your claims with real files.
> Although MD5 is a smaller, older hash (128 bits, so you only need
> 2**64 files to find collisions),
You don't need quite that many to have a significant chance of
a collision. With "only" something on the order of 2**61
files, you still have about a 1% chance of a collision.
Just for fun here are a few data points showing the probability
of at least one collision in a sample of size n using a perfect
128-bit hash. These were calculated using the taylor-series
approximation shown at
http://en.wikipedia.org/wiki/Birthday_paradox#Calculating_the_probability
For "a few million files" (we'll say 4e6), the probability of a
collision is so close to 0 that it can't be calculated using
double-precision IEEE floats.
For a few _billion_ files, the taylor series approximation is
still 0 to about 15 significant digits.
For a few _trillion_ files, you finally get a "non-zero"
probability of about 2e-14.
Here are a few more points:
n = 1806791225998070; p(n) = 0.00000000
n = 1987470348597877; p(n) = 0.00000001
n = 2186217383457665; p(n) = 0.00000001
n = 2404839121803431; p(n) = 0.00000001
n = 2645323033983774; p(n) = 0.00000001
n = 2909855337382151; p(n) = 0.00000001
n = 3200840871120366; p(n) = 0.00000002
n = 3520924958232403; p(n) = 0.00000002
n = 3873017454055643; p(n) = 0.00000002
n = 4260319199461207; p(n) = 0.00000003
n = 4686351119407328; p(n) = 0.00000003
n = 5154986231348061; p(n) = 0.00000004
n = 5670484854482868; p(n) = 0.00000005
n = 6237533339931155; p(n) = 0.00000006
n = 6861286673924271; p(n) = 0.00000007
n = 7547415341316699; p(n) = 0.00000008
n = 8302156875448370; p(n) = 0.00000010
n = 9132372562993208; p(n) = 0.00000012
n = 10045609819292530; p(n) = 0.00000015
n = 11050170801221784; p(n) = 0.00000018
n = 12155187881343964; p(n) = 0.00000022
n = 13370706669478362; p(n) = 0.00000026
n = 14707777336426200; p(n) = 0.00000032
n = 16178555070068822; p(n) = 0.00000038
n = 17796410577075706; p(n) = 0.00000047
n = 19576051634783280; p(n) = 0.00000056
n = 21533656798261608; p(n) = 0.00000068
n = 23687022478087772; p(n) = 0.00000082
n = 26055724725896552; p(n) = 0.00000100
n = 28661297198486208; p(n) = 0.00000121
n = 31527426918334832; p(n) = 0.00000146
n = 34680169610168320; p(n) = 0.00000177
n = 38148186571185152; p(n) = 0.00000214
n = 41963005228303672; p(n) = 0.00000259
n = 46159305751134040; p(n) = 0.00000313
n = 50775236326247448; p(n) = 0.00000379
n = 55852759958872200; p(n) = 0.00000458
n = 61438035954759424; p(n) = 0.00000555
n = 67581839550235368; p(n) = 0.00000671
n = 74340023505258912; p(n) = 0.00000812
n = 81774025855784816; p(n) = 0.00000983
n = 89951428441363312; p(n) = 0.00001189
n = 98946571285499648; p(n) = 0.00001439
n = 108841228414049616; p(n) = 0.00001741
n = 119725351255454592; p(n) = 0.00002106
n = 131697886381000064; p(n) = 0.00002548
n = 144867675019100096; p(n) = 0.00003084
n = 159354442521010112; p(n) = 0.00003731
n = 175289886773111136; p(n) = 0.00004515
n = 192818875450422272; p(n) = 0.00005463
n = 212100762995464512; p(n) = 0.00006610
n = 233310839295010976; p(n) = 0.00007998
n = 256641923224512096; p(n) = 0.00009678
n = 282306115546963328; p(n) = 0.00011710
n = 310536727101659712; p(n) = 0.00014169
n = 341590399811825728; p(n) = 0.00017144
n = 375749439793008320; p(n) = 0.00020744
n = 413324383772309184; p(n) = 0.00025099
n = 454656822149540160; p(n) = 0.00030369
n = 500122504364494208; p(n) = 0.00036745
n = 550134754800943680; p(n) = 0.00044460
n = 605148230281038080; p(n) = 0.00053794
n = 665663053309141888; p(n) = 0.00065088
n = 732229358640056192; p(n) = 0.00078751
n = 805452294504061824; p(n) = 0.00095280
n = 885997523954468096; p(n) = 0.00115278
n = 974597276349915008; p(n) = 0.00139469
n = 1072057003984906624; p(n) = 0.00168733
n = 1179262704383397376; p(n) = 0.00204131
n = 1297188974821737216; p(n) = 0.00246945
n = 1426907872303911168; p(n) = 0.00298726
n = 1569598659534302464; p(n) = 0.00361345
n = 1726558525487732736; p(n) = 0.00437061
n = 1899214378036506112; p(n) = 0.00528601
n = 2089135815840156928; p(n) = 0.00639252
n = 2298049397424172800; p(n) = 0.00772975
n = 2527854337166590464; p(n) = 0.00934539
n = 2780639770883249664; p(n) = 0.01129680
Here's the Python function I'm using:
def bp(n, d):
return 1.0 - exp(-n*(n-1.)/(2.*d))
I haven't spent much time studying the numerical issues of the
way that the exponent is calculated, so I'm not entirely
confident in the results for "small" n values such that
p(n) == 0.0.
--
Grant Edwards grante Yow! ... the MYSTERIANS are
at in here with my CORDUROY
visi.com SOAP DISH!!
More information about the Python-list
mailing list