BUG? sha-moduel returns same crc for different files

Treutwein Guido Guido.Treutwein at nbg.siemens.de
Mon Sep 18 09:16:11 EDT 2000


jepler epler schrieb:

> On Sun, 17 Sep 2000 20:06:23 -0400, alex_holubec
>  <alex_holubec at email.msn.com> wrote:
> >Another Solution.
> >Hashing is an overkill. The following method has been used by Herr Professor
> >Doktor Niklaus Wirth :-)
> >in Project Oberon: open both files in binary mode and compare byte by byte.
> >Very simple and fast.

>
> Let's assume the original poster had a good reason, like the above, to choose
> to use hash functions.  I hope we can discover why he's still having problems.
>
> Jeff

Ok, lets assume it is NOT an eof effect and the whole file is fed correctly into
the sha-1.
a) A hash code is is a mapping into a fixed length string of 160 bits (sha-1)
respectively 128 bits (md-5).
b) Nothing prevents files with different  lengths from having the same hash code.
c) While the probabilty of a file, to have a certain hash code is
2^(-hash_bitlength), the probability of finding two files with the same hash value
is MUCH bigger; this is the so-called birthday paradox. (due to the fact, that
having 23 persons in a room, the probabilty of having to with the same birthday is
better than 50%; for a 32bit-CRC the corresponding limit is about 77000 files for a
50% chance). For this reason, standardization bodies move towards larger hash sizes
like 256 bit.

Suggestions:
Verify with a different implementation, that the SHA-1 results for these two files
are correct.
Don't simply switch to md5, since the probability of hash value collision will
*raise* due to the smaller hash size.
The mentioned hash functions are relatively slow due to cryptographic requirements.
(It shall be difficult, to find an input for a given output, etc).These
requirements are not valid for the given application, so a faster and less
complicated function with a bigger result size could be used.
If you don't have the time to write a hash function as C extension package consider
to use a combination of sha-1, md-5, file size and crc32

Guido



More information about the Python-list mailing list