[Spambayes] Seeking a giant idle machine w/ a miserable corpus

Tim Peters tim.one@comcast.net
Mon Nov 18 06:48:54 2002


[Rob Hooft]
> I collected some unigram statistics yesterday: training hammie on my
> 2x10 sets in the corpus one by one, and after each 1600ham+580spam set
> run a program that reports the would-be collisions using the python 32
> bit hash function:
>
> Set1 : 109280
> Set2 : 183560
> Set3 : 227699 (2 clashes)
> Set4 : 277253 (3 clashes)
> Set5 : 329662 (5)
> Set6 : 362847 (7)
> Set7 : 394585 (12)
> Set8 : 422898 (12)
> Set9 : 448767 (16)
> Set10: 481393 (22)

I'm assuming the "big numbers" there are the number of distinct tokens,
rather than the number of distinct 32-bit hash codes.  If so, they're all a
little better than could be expected from a truly random 32-bit hash code.

BTW, after tossing N balls into M buckets, the expected # of occupied
buckets has

mean       M - M*(1-1/M)**N
variance   M*(M-1)*(1-2/M)**N + M*(1-1/M)**N - M**2*(1-1/M)**(2*N)

Unfortunately, those expressions are numerically intractable using double
precision when M and N get large.  The exact distribution is intractable
even in theory; Knuth gives an iterative algorithm for computing it given
specific M and N, which takes time super-linear in N.  I have software left
over for this stuff from Python's years-ago experiments.

> ...
> Here, the number of hash collisions is still fairly low, but subtract
> bits, and see it explode.....

Oh yes!  The biggest pickle I've got sitting around has 327,439 tokens.
Using the last 20 bits of the hash code means using 2**20 ~= a million
buckets, and the mean number of collisions then is expected to be 46193.8,
with sdev 174.5 (nb: it's not a normal distribution).  Actually doing this
gave

    46,184  collisions using the low 20 bits of hash()
    46,481  collisions using the low 20 bits of binascii.crc32()

So either way is "random enough" for this range of numbers.

> Another thing that I learned from this, is that the number of distinct
> words with this test does not increase with the sqrt of the number of
> messages.

Perhaps not, but we're going to pretend that it is anyway because that's
such a pretty & quotable result <wink>.




More information about the Spambayes mailing list