[spambayes-dev] RE: [Spambayes] How low can you go?

Ryan Malayter rmalayter at bai.org
Thu Dec 18 18:57:11 EST 2003


{Seth Goodman}
> All your arguments on this point make lots of sense. 
> I'm a little surprised that you had significant 
> collisions mapping perhaps 100K items (my guess)
> into a 32-bit space.  I think that is rather dependent 
> on the hash used, but that's what you saw. 

That's not surprising at all to me. Because of the "birthday paradox",
even very input-sensitive (random-looking) hash functions like the
160-bit SHA-1 only give 80 bits of collision resistance. With a 32 bit
perfect hash, you get just 16 bits of collision resistance. That means
there is a 50% chance of a collision if you hash just 65,536 items. Hash
more items than that, and your chances of collision go up further.

If your hash function isn't perfectly (randomly) distributed in the
32-bit space, things could be much worse with 100,000 hashes in a
collection.

I would suggest using storing at least a 64 bit hash; perhaps the first
8 bytes of an SHA-1 or MD5 hash would be appropriate. There exists good
optimized code for both algorithms in the public domain.

Regards,
	Ryan



More information about the spambayes-dev mailing list