[Spambayes] sharing wordlists - better numbers

T. Alexander Popiel popiel at wolfskeep.com
Tue May 27 08:57:02 EDT 2003


In message:  <16083.28342.289866.251818 at montanaro.dyndns.org>
             Skip Montanaro <skip at pobox.com> writes:
>
>    Brad> What can we say about these 12000 words? 
>
>That they are common? ;-)
>
>How about the next step in the exercise?  I propose that Alex and I
>(assuming Alex is amenable) each extract a non-hapax version of our word
>databases (real keys, real counts) then from that further extract a database
>from that of the most hammy and spammy words (how about <0.2 and >0.8?), run
>the usual tests against them and see how they do.  If that looks promising,
>we can merge the common words from the two, test again, see how big the
>result is, then decide whether to include it in a later distribution.

I'm amenable, but I'm a bit short on time at the moment.  If you
have the time, I can give you my entire database... otherwise it'll
likely wait until sometime this coming weekend.

>What's the formula again for computing the ham/spam probability for a single
>word given its counts in spam and ham messages?  I can never remember it and
>can't locate it in the source.  Is it just the usual 0...1 sort of thing:
>
>    1 - nham/(nham+nspam)

Nope.  Method 'probability' in 'classifier.py':

        spamcount = record.spamcount
        hamcount = record.hamcount
       
        nham = float(self.nham or 1)
        nspam = float(self.nspam or 1)

        assert hamcount <= nham
        hamratio = hamcount / nham

        assert spamcount <= nspam
        spamratio = spamcount / nspam

        prob = spamratio / (hamratio + spamratio)

        if options.experimental_ham_spam_imbalance_adjustment:
            spam2ham = min(nspam / nham, 1.0)
            ham2spam = min(nham / nspam, 1.0)
        else:
            spam2ham = ham2spam = 1.0

        S = options.unknown_word_strength
        StimesX = S * options.unknown_word_prob

        n = hamcount * spam2ham  +  spamcount * ham2spam
        prob = (StimesX + n * prob) / (S + n)


>?  Also, what's the key in the database which stores the total spam and ham
>counts?

Urgh... I'm forgetting this one.  In the pickle, they're stored
outside the wordlist, but I'm having trouble finding the non-pickle
version of the storage... ah, here is is: it's stored under the
"saved state" entry.

- Alex



More information about the Spambayes mailing list