[Spambayes] train-to-exhaustion questions

Thu Apr 26 06:13:29 CEST 2007

1. A recent training run went like this:

  round:  1, msgs:  690, ham misses:  61, spam misses: 210, 176.3s
  round:  2, msgs:  690, ham misses:   8, spam misses:  53, 165.6s
  round:  3, msgs:  690, ham misses:   1, spam misses:   7, 159.6s
  round:  4, msgs:  690, ham misses:   1, spam misses:   2, 159.6s
  round:  5, msgs:  690, ham misses:   0, spam misses:   1, 157.8s
  round:  6, msgs:  690, ham misses:   1, spam misses:   1, 160.9s
  round:  7, msgs:  690, ham misses:   0, spam misses:   1, 211.0s
  round:  8, msgs:  690, ham misses:   0, spam misses:   1, 172.6s
  round:  9, msgs:  690, ham misses:   0, spam misses:   1, 197.1s
  round: 10, msgs:  690, ham misses:   1, spam misses:   1, 174.6s

  It seems that the results got *worse* in rounds 6 and 10.  Am I
  misinterpreting this?  Are these expected results?

2. I have about 350 each of ham and spam that I can use to train on.
   I'm sure that some of these messages are mostly redundant and add
   little or nothing of value to the training data.  I don't want to
   waste time on them every time I do a training run.  Is there some
   way to use tte.py to reduce my training set to the messages that
   actually make a difference?

Thanks in advance!

-- 
Dave Abrahams
Boost Consulting
http://www.boost-consulting.com

Don't Miss BoostCon 2007! ==> http://www.boostcon.com