[Spambayes] RE: [spambayes-bugs] Spambayes repeatedly classifies messages frommailing list as SPAM despite multiple (20+) recoveries fromspam folder

Meyer, Tony T.A.Meyer at massey.ac.nz
Wed Sep 3 14:36:28 EDT 2003


> Sorry, I should have thought of that.  I'm not really that 
> surprised that the message gets flagged as spam initially - 
> I'm just surprised that after a couple of weeks of "teaching" 
> that this mailing list message hasn't been "whitelisted."
> 
> Spam Score: 0.833733
> 'url:mydomain'                         0.00493094      17079      0
[...]
> 'skip:g 10'                         0.639491          176      1
> 'cannot'                            0.643938          171      1
[...]

Do you have really unbalanced numbers of ham & spam?  For example,
"cannot" is in 171 ham messages, but only 1 spam message - it really
shouldn't get a score of 0.64.

Spambayes works best trained with roughly equal numbers of ham & spam;
we're still trying to come up with a good method of working with
unbalanced training data.  At the moment there is an option (defaults to
'on' in the Outlook plug-in) that adjusts the scores for unbalanced
mail.  It looks like this is what is happening here - because of the
imbalance, a perfectly hammy word like "cannot" is getting a 0.64 score.

Two suggestions:

 o Try disabling the experimental_ham_spam_imbalance_adjustment option.
You can do this by changing the default_bayes_customize.ini file in your
data folder (see the FAQ for the location).  You don't need to retrain
to see the effects of this, so have a look at that same message with the
option off and see what the score is.  Note that you might get more
false-negatives, though.

 o Try retraining with roughly equal numbers of ham and spam (just take
a random selection of the bigger collection) and see how that works.

=Tony Meyer



More information about the Spambayes mailing list