md5 and large files

Jeff Epler jepler at
Sun Oct 17 18:55:06 CEST 2004

It seems likely that 2 files would have the same 4k "preamble".

For instance, a unix tar file containing a 16k "file1" and then a 1k
"file2" would have the same leading bytes as a unix tar file containing
a 16k "file1" and a 1k "file3", and therefore the md5sum over the first
4k would match. (these two tar files would also have the same byte

If all pages on some website begin
        <SCRIPT> pages and pages of javascript here (at least 4k) </SCRIPT>
        <TITLE> ...
the initial 4k might match, too.

But anyway, if s1 != s2, then the odds that hash(s1) != hash(s2) should
be small, and that shouldn't depend on the length of the string.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL: <>

More information about the Python-list mailing list