Tim Peters tim_one at
Wed Nov 24 23:09:25 CET 1999

[Thomas Weholt]
> Ok, so I`ve looked into zlib.crc32 and zlib.adler32. They seem easy
> enough to use, but I thought crc-codes had characters and numbers in
> them, not just a plain integer like the methods above return.

Bits is bits.  Whether you choose to view them as integers, or characters,
or little stick men struggling to avoid falling into little holes is all in
your head.  That is, don't worry about it!

> ( As you can see, I`m a complete ass on this subject,

In that case, I'll be toilet paper.

> but don`t have time to do the proper research myself, and was hoping
> for a "quick fix" ... )
> A friend of mine mentioned that I should try SHA-1 instead, for more
> accuracy. Can anybody give me an example on how to compute crc-codes,
> using zlib or preferrably some more accurate method, for single files ??
> If this is all it takes :
> crc = module_name.crc_method(file)
> and comparison is done like :
> if (crc1 == crc2): print "Equal."
> else : print "Different."
> then all I need is the name of the most effective/accurate module to
> use.

That's about it.  Strictly it should be:

if crc1 == crc2:
    print "Probably equal."
    print "Definitely different."

Equality of hash codes does not guarantee equality of files under any
method, it merely makes it likely.  Throw two random but unequal files at
one of the xxx32 methods, and there's (exactly) 1 chance in 2**32 (about 1
in 4 billion) that they'll return the same hash code.  The odds for MD5 are
about 1 in 2**128, and for SHA-1 about 1 in 2**160.

Like everything else, it's a tradeoff.  CRCs are cheap to compute, but very
easy to fool on purpose (it's trivial to construct any number of distinct
files that have the same CRC hashcode).  The other methods are expensive to
compute, but extremely difficult to fool on purpose (in fact, it's believed
to be computationally infeasible).  This reflects their design goals:  CRCs
were designed to catch transmission errors (accidents of nature), while MD5
and SHA-1 were designed to stop malicious people with large piles of money
(accidents of nature <wink>).

> If, for some strange reason, I should use one module instead of another,
> that info would be interesting too.

In practice, I like a two-step approach:

1. Compute CRCs.  If they're different, the files are definitely different,
   and you get out cheap.

2. Compute SHA-1.  If they're different too, the files are also definitely
   different.  If they're the same, then since the CRC and SHA-1 methods
   are unrelated, it's reasonable to expect that the odds of a false match
   are about 1 in 2**32 * 2**160 == 2**192 now.  This is even smaller than
   the chance Python2 will have curly-brace block delimeters.

which-in-turn-is-smaller-than-the-chance-you'll-wake-up-in-a-womb-ly y'rs
    - tim

More information about the Python-list mailing list