binary file compare...

Nigel Rantor wiggly at wiggly.org
Thu Apr 16 12:12:57 CEST 2009


Adam Olsen wrote:
> On Apr 16, 3:16 am, Nigel Rantor <wig... at wiggly.org> wrote:
>> Adam Olsen wrote:
>>> On Apr 15, 12:56 pm, Nigel Rantor <wig... at wiggly.org> wrote:
>>>> Adam Olsen wrote:
>>>>> The chance of *accidentally* producing a collision, although
>>>>> technically possible, is so extraordinarily rare that it's completely
>>>>> overshadowed by the risk of a hardware or software failure producing
>>>>> an incorrect result.
>>>> Not when you're using them to compare lots of files.
>>>> Trust me. Been there, done that, got the t-shirt.
>>>> Using hash functions to tell whether or not files are identical is an
>>>> error waiting to happen.
>>>> But please, do so if it makes you feel happy, you'll just eventually get
>>>> an incorrect result and not know it.
>>> Please tell us what hash you used and provide the two files that
>>> collided.
>> MD5
>>
>>> If your hash is 256 bits, then you need around 2**128 files to produce
>>> a collision.  This is known as a Birthday Attack.  I seriously doubt
>>> you had that many files, which suggests something else went wrong.
>> Okay, before I tell you about the empirical, real-world evidence I have
>> could you please accept that hashes collide and that no matter how many
>> samples you use the probability of finding two files that do collide is
>> small but not zero.
> 
> I'm afraid you will need to back up your claims with real files.
> Although MD5 is a smaller, older hash (128 bits, so you only need
> 2**64 files to find collisions), and it has substantial known
> vulnerabilities, the scenario you suggest where you *accidentally*
> find collisions (and you imply multiple collisions!) would be a rather
> significant finding.

No. It wouldn't. It isn't.

The files in question were millions of audio files. I no longer work at 
the company where I had access to them so I cannot give you examples, 
and even if I did Data Protection regulations wouldn't have allowed it.

If you still don't beleive me you can easily verify what I'm saying by 
doing some simple experiemnts. Go spider the web for images, keep 
collecting them until you get an MD5 hash collision.

It won't take long.

> Please help us all by justifying your claim.

Now, please go and re-read my request first and admit that everything I 
have said so far is correct.

> Mind you, since you use MD5 I wouldn't be surprised if your files were
> maliciously produced.  As I said before, you need to consider
> upgrading your hash every few years to avoid new attacks.

Good grief, this is nothing to do with security concerns, this is about 
someone suggesting to the OP that they use a hash function to determine 
whether or not two files are identical.

Regards,

   Nige



More information about the Python-list mailing list