BUG? sha-moduel returns same crc for different files

jepler epler jepler.lnk at lnk.ispi.net
Sun Sep 17 22:44:52 EDT 2000


On Sun, 17 Sep 2000 20:06:23 -0400, alex_holubec
 <alex_holubec at email.msn.com> wrote:
>Another Solution.
>Hashing is an overkill. The following method has been used by Herr Professor
>Doktor Niklaus Wirth :-)
>in Project Oberon: open both files in binary mode and compare byte by byte.
>Very simple and fast.

Well, the sha or md5 hash helps you when you want to perform many
comparisons --- for instance, consider the problem "of these 100 files,
tell me which (if any) are identical.  You're facing 100^2 "file compares"
for a naive nested-for-loop implementation, and 100 log 100 "file compares"
for a smart implementation.

Now, if you could compare 16 or 20 byte tokens instead of 100k-1M files,
you've just trimmed the constant factor of your O(N log N) time by a factor of
5000 to 50000, while adding in an O(N) term which is negligible compared to
N log N when N grows large.

Another reason to compare tokens is because you might not wish to keep the
entire file around.  For instance, imagine that you wanted to know that some
document is different from any one you've had before, but that you don't have
sufficient storage to keep all the documents around.  Well, just store the hash
of any document in some database when you get it, and then compare the hash of
the new file to each of the hashes in the database.  Not only have you trimmed
storage by a factor of 5000 to 50000, you have also reduced the runtime by
a similar factor.

Let's assume the original poster had a good reason, like the above, to choose
to use hash functions.  I hope we can discover why he's still having problems.

Jeff



More information about the Python-list mailing list