[Spambayes] Advice on when to stop training.
T.A.Meyer at massey.ac.nz
Wed Sep 24 21:41:28 EDT 2003
> I am beginning to
> get a bit tired of having to train it all the time
No-one has really managed to establish what the 'best' training regime
is, but if it's classifying correctly, then you can probably safely
stop, or only train on unsures/misclassified messages. Tim Peters, the
godfather <wink> of the project, has a pretty small set of training data
and gets excellent results. More is not always better!
> (plus I am worried that the cache will fill the disk).
Note that there are options to assist with this. You can set the number
of days messages stay in the cache (defaults to 7), after which they are
permanently removed. You can also elect not to cache messages over a
certain size, or not to cache 'bulk' messages (the majority of mailing
list traffic will identify itself as 'bulk'). You can, of course, turn
caching off completely, but, IMO, it would be better to leave it on and
just reduce the expiry time (unless you receive too much mail to store
in a single day).
> I would prefer to just send wrongly classified mails to
> spambayes_spam/ham at localhost and not have to deal with
> training all the mails that are delivered to me.
Note that, unless you can be certain that you are sending the mails
unaltered (few mail clients will let you do this), you should use the
'lookup message in cache' option for smtpproxy, which means storing the
messages in the cache for a certain amount of time anyway.
> What settings should I use to accomplish this?
I would recommend that if you are happy with the classification in
general, you enable the "don't cache messages larger than" option
(picking a size of your choice), and the "don't cache bulk mail" option.
You could also reduce the expiry time to 2 or 3 days, depending on how
often you go through your mail. Unless you are able to send mail
unaltered (with procmail or something like that), then you should have
the "lookup messages in cache" option enabled. You also need to enter
your smtp server details. That should be it - note that you can still
use the web interface to review mail - it's both, not either/or.
More information about the Spambayes