binary file compare...

Adam Olsen rhamph at
Thu Apr 16 11:44:06 CEST 2009

On Apr 16, 3:16 am, Nigel Rantor <wig... at> wrote:
> Adam Olsen wrote:
> > On Apr 15, 12:56 pm, Nigel Rantor <wig... at> wrote:
> >> Adam Olsen wrote:
> >>> The chance of *accidentally* producing a collision, although
> >>> technically possible, is so extraordinarily rare that it's completely
> >>> overshadowed by the risk of a hardware or software failure producing
> >>> an incorrect result.
> >> Not when you're using them to compare lots of files.
> >> Trust me. Been there, done that, got the t-shirt.
> >> Using hash functions to tell whether or not files are identical is an
> >> error waiting to happen.
> >> But please, do so if it makes you feel happy, you'll just eventually get
> >> an incorrect result and not know it.
> > Please tell us what hash you used and provide the two files that
> > collided.
> MD5
> > If your hash is 256 bits, then you need around 2**128 files to produce
> > a collision.  This is known as a Birthday Attack.  I seriously doubt
> > you had that many files, which suggests something else went wrong.
> Okay, before I tell you about the empirical, real-world evidence I have
> could you please accept that hashes collide and that no matter how many
> samples you use the probability of finding two files that do collide is
> small but not zero.

I'm afraid you will need to back up your claims with real files.
Although MD5 is a smaller, older hash (128 bits, so you only need
2**64 files to find collisions), and it has substantial known
vulnerabilities, the scenario you suggest where you *accidentally*
find collisions (and you imply multiple collisions!) would be a rather
significant finding.

Please help us all by justifying your claim.

Mind you, since you use MD5 I wouldn't be surprised if your files were
maliciously produced.  As I said before, you need to consider
upgrading your hash every few years to avoid new attacks.

More information about the Python-list mailing list