[Spambayes] sbserver question

Tony Meyer tameyer at ihug.co.nz
Sat May 15 23:09:42 EDT 2004


> The status reports: Total emails trained: Spam: 1935 Ham: 323
> Then prints the message: "Warning: you have much more spam than ham -
> SpamBayes works best with approximately even numbers of ham and spam."
> 
> I've searched and can't find a way to 'delete' spam messages, 
> particularly older spam, from the training database.  Can anyone
> instruct me how this can be done.  If there's a warning that the
> spam/ham ratio is too high, there ought to be a way to correct it.

Unfortunately, there are only two ways to correct it at present (through
sb_server):

  1.  Retrain from scratch.
  2.  Train more [whichever is low].

Imbalance, training regimes, and expiring old messages are all things that
people are still looking into, and there isn't a clear answer as to what to
do, so things aren't as good as they could be at the moment.

The hope of the warning message was that people would see it when they had
(eg) 100 spam and 20 ham, and so could just hold off training spam until
they had a more balanced corpus.

Note that using train-on-mistakes with appropriate thresholds, or non-edge
training, tends to result in a reasonably balanced corpus, as well as good
results.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.




More information about the Spambayes mailing list