[Spambayes] RE: [spambayes-bugs] Spambayes repeatedly classifies messagesfrommailing list as SPAM despite multiple (20+) recoveriesfromspam folder

Tue Sep 9 22:47:36 EDT 2003

[Brian Schwarz]
>>> It's not a big issue, but I've noticed that my m-w.com "word
>>> of the day" is consistently flagged by Spambayes as spam [...]
>>> Here is an example of the mailing list messages that keep
>>> getting mislabeled as spam.

[Tony Meyer]
>> Would you also be able to send a copy of the tokens (with
>> scores/count) that this message produces?  You can do this via the
>> "Show Clues" command in the Outlook plug-in, or via the web
>> interface, or with the "debug" header in hammiefilter.  (It would
>> also be great to know which application you are using).
>>
>> It's the clues that provide the clues ;), not the message itself.

[Brian]
> Sorry, I should have thought of that.  I'm not really that surprised
> that the message gets flagged as spam initially - I'm just surprised
> that after a couple of weeks of "teaching" that this mailing list
> message hasn't been "whitelisted."
>
> Spam Score: 0.833733
>
>
> word                                spamprob         #ham  #spam
> '*H*'                               0.0566883           -      -
> '*S*'                               0.724154            -      -
> 'url:mydomain'                         0.00493094      17079      0

Yikes!  From this line I deduce you trained on about 378 *times* more ham
messages than spam messages.  spambayes works best if you train on an
approximately equal number of each.  Continuing to train on even more ham
than spam probably isn't going to help you at all.

So try balancing your training data (i.e., train on much less ham) and see
what happens.  If you don't want to try that, find your
default_bayes_customize.ini file, and change the line

experimental_ham_spam_imbalance_adjustment: True

to

experimental_ham_spam_imbalance_adjustment: False

That should help your specific problem a lot, but may increase the false
negative rate too (may give hammier scores to genuine spam messages).  If
you try it, let us know what happens.  Nobody developing this code had such
extreme training-set imbalance, and we really don't know what to do about
it.  Nothing we've tried so far works well for everyone in its presence
(apart from users finding a way to balance their training data themselves).