[Spambayes] sharing wordlists - better numbers

Skip Montanaro skip at pobox.com
Tue May 27 09:57:10 EDT 2003


    Brad> What can we say about these 12000 words? 

That they are common? ;-)

How about the next step in the exercise?  I propose that Alex and I
(assuming Alex is amenable) each extract a non-hapax version of our word
databases (real keys, real counts) then from that further extract a database
from that of the most hammy and spammy words (how about <0.2 and >0.8?), run
the usual tests against them and see how they do.  If that looks promising,
we can merge the common words from the two, test again, see how big the
result is, then decide whether to include it in a later distribution.

What's the formula again for computing the ham/spam probability for a single
word given its counts in spam and ham messages?  I can never remember it and
can't locate it in the source.  Is it just the usual 0...1 sort of thing:

    1 - nham/(nham+nspam)

?  Also, what's the key in the database which stores the total spam and ham
counts?

Thx,

Skip



More information about the Spambayes mailing list