[Spambayes] Confusion about Unix or Linus documentation

Skip Montanaro skip at pobox.com
Thu Jul 22 01:09:33 CEST 2004


    Aaron> If the messages that the user has identified as spam are placed
    Aaron> in the same file as the messages that spambayes has already
    Aaron> identified as spam, then we would be training on messages that
    Aaron> have never been used in training but spambayes has already
    Aaron> identified as spam.

    Aaron> Is this a useful thing to so?

That depends on your desired training strategy.  Using a train-on-everything
strategy, you'd certainly want that behavior, but I agree, in this day and
age with 80-something per cent of mail being sent purportedly spam, this can
grow your training database rapidly and perhaps unnecessarily.  Still, to
catch false positives you have to save them somewhere.  My procmail setup
saves all messages which score as spam and that aren't deleted outright into
a specific mailbox which I scan periodically.  The only thing in that
mailbox I'd want to train on are false positives, which are rare beasties
indeed.

You might check here:

    http://www.entrian.com/sbwiki/TrainingIdeas

for some of the (many) training strategy options.  I use train-to-exhaustion
in such a way that throwing a message into the ham or spam pile that would
already score correctly gets it tossed out on its bum on the next pass, so
having a few extra of something is no big deal.

Skip


More information about the Spambayes mailing list