[Spambayes] RE: [spambayes-bugs] Spambayes repeatedly classifies
messages frommailing list as SPAM despite multiple (20+)
recoveries fromspam folder
Meyer, Tony
T.A.Meyer at massey.ac.nz
Wed Sep 3 14:36:28 EDT 2003
> Sorry, I should have thought of that. I'm not really that
> surprised that the message gets flagged as spam initially -
> I'm just surprised that after a couple of weeks of "teaching"
> that this mailing list message hasn't been "whitelisted."
>
> Spam Score: 0.833733
> 'url:mydomain' 0.00493094 17079 0
[...]
> 'skip:g 10' 0.639491 176 1
> 'cannot' 0.643938 171 1
[...]
Do you have really unbalanced numbers of ham & spam? For example,
"cannot" is in 171 ham messages, but only 1 spam message - it really
shouldn't get a score of 0.64.
Spambayes works best trained with roughly equal numbers of ham & spam;
we're still trying to come up with a good method of working with
unbalanced training data. At the moment there is an option (defaults to
'on' in the Outlook plug-in) that adjusts the scores for unbalanced
mail. It looks like this is what is happening here - because of the
imbalance, a perfectly hammy word like "cannot" is getting a 0.64 score.
Two suggestions:
o Try disabling the experimental_ham_spam_imbalance_adjustment option.
You can do this by changing the default_bayes_customize.ini file in your
data folder (see the FAQ for the location). You don't need to retrain
to see the effects of this, so have a look at that same message with the
option off and see what the score is. Note that you might get more
false-negatives, though.
o Try retraining with roughly equal numbers of ham and spam (just take
a random selection of the bigger collection) and see how that works.
=Tony Meyer
More information about the Spambayes
mailing list