[Spambayes] Question about training via the web interface

Kenny Pitt kennypitt at hotmail.com
Thu Apr 15 11:15:37 EDT 2004


Skip Montanaro wrote:
> Here's a thought...  Instead of blasting through your entire training
> set all at once, break it into chunks, say 100 messages each.  t-t-e
> on the first set, then using the resulting database t-t-e on the
> second set, etc. My guess is that after training to exhaustion on set
> 1, more messages in set 2 will score properly on the first pass and
> not need to be used as training fodder.  The result might be a faster
> run time for the entire set and a smaller database.

Don't know if this has been suggested before, but this leads me to
another idea about using TTE for "incremental" training.

For "initial" training, we start from an empty training database and
perform TTE over the entire training set.  For incremental training, it
might be too expensive to train over that entire set every time, even in
100 message chunks.  Besides, not everyone will want to keep the entire
history of received messages around for future training.

Instead, we could cache only the most recent 100 messages (or whatever
number is reasonable for performance) to use in the next training
session.  Each training session would start from the *existing* training
database and perform TTE over the cached messages plus the newly
received messages.  After training, the new messages would be added to
the cache and the oldest messages deleted to get back to the desired
cache size.  I believe Gary alludes to something like this in his TTE
instructions, but doesn't give any details about how to do it.

-- 
Kenny Pitt




More information about the Spambayes mailing list