[Spambayes] sharing wordlists - better numbers
T. Alexander Popiel
popiel at wolfskeep.com
Tue May 27 08:57:02 EDT 2003
In message: <16083.28342.289866.251818 at montanaro.dyndns.org>
Skip Montanaro <skip at pobox.com> writes:
>
> Brad> What can we say about these 12000 words?
>
>That they are common? ;-)
>
>How about the next step in the exercise? I propose that Alex and I
>(assuming Alex is amenable) each extract a non-hapax version of our word
>databases (real keys, real counts) then from that further extract a database
>from that of the most hammy and spammy words (how about <0.2 and >0.8?), run
>the usual tests against them and see how they do. If that looks promising,
>we can merge the common words from the two, test again, see how big the
>result is, then decide whether to include it in a later distribution.
I'm amenable, but I'm a bit short on time at the moment. If you
have the time, I can give you my entire database... otherwise it'll
likely wait until sometime this coming weekend.
>What's the formula again for computing the ham/spam probability for a single
>word given its counts in spam and ham messages? I can never remember it and
>can't locate it in the source. Is it just the usual 0...1 sort of thing:
>
> 1 - nham/(nham+nspam)
Nope. Method 'probability' in 'classifier.py':
spamcount = record.spamcount
hamcount = record.hamcount
nham = float(self.nham or 1)
nspam = float(self.nspam or 1)
assert hamcount <= nham
hamratio = hamcount / nham
assert spamcount <= nspam
spamratio = spamcount / nspam
prob = spamratio / (hamratio + spamratio)
if options.experimental_ham_spam_imbalance_adjustment:
spam2ham = min(nspam / nham, 1.0)
ham2spam = min(nham / nspam, 1.0)
else:
spam2ham = ham2spam = 1.0
S = options.unknown_word_strength
StimesX = S * options.unknown_word_prob
n = hamcount * spam2ham + spamcount * ham2spam
prob = (StimesX + n * prob) / (S + n)
>? Also, what's the key in the database which stores the total spam and ham
>counts?
Urgh... I'm forgetting this one. In the pickle, they're stored
outside the wordlist, but I'm having trouble finding the non-pickle
version of the storage... ah, here is is: it's stored under the
"saved state" entry.
- Alex
More information about the Spambayes
mailing list