[Spambayes] train-to-exhaustion questions
skip at pobox.com
skip at pobox.com
Thu Apr 26 20:15:11 CEST 2007
David> 1. A recent training run went like this:
David> round: 1, msgs: 690, ham misses: 61, spam misses: 210, 176.3s
David> round: 2, msgs: 690, ham misses: 8, spam misses: 53, 165.6s
David> round: 3, msgs: 690, ham misses: 1, spam misses: 7, 159.6s
David> round: 4, msgs: 690, ham misses: 1, spam misses: 2, 159.6s
David> round: 5, msgs: 690, ham misses: 0, spam misses: 1, 157.8s
David> round: 6, msgs: 690, ham misses: 1, spam misses: 1, 160.9s
David> round: 7, msgs: 690, ham misses: 0, spam misses: 1, 211.0s
David> round: 8, msgs: 690, ham misses: 0, spam misses: 1, 172.6s
David> round: 9, msgs: 690, ham misses: 0, spam misses: 1, 197.1s
David> round: 10, msgs: 690, ham misses: 1, spam misses: 1, 174.6s
David> It seems that the results got *worse* in rounds 6 and 10. Am I
David> misinterpreting this? Are these expected results?
I would look through your log files, probably near the end, to see if there
is some message that is either a mistake or extremely hard to properly
classify. One type of spam that gives the tte algorithm fits is a very
short spam sent to an otherwise normally well-behaved mailing list. You
have a bunch of hammy header clues and one (or at most a few) spammy clues.
The first several times through that message will probably score as unsure.
It will thus get retrained over and over. After awhile, the hammy header
clues will get less hammy, perhaps some even spammy. That will drag some
previously ok ham messages into the unsure category. A vicious cycle
ensues. When this happens I just delete that problematic spam from the
database and retrain.
David> 2. I have about 350 each of ham and spam that I can use to train
David> on. I'm sure that some of these messages are mostly redundant
David> and add little or nothing of value to the training data. I
David> don't want to waste time on them every time I do a training
David> run. Is there some way to use tte.py to reduce my training
David> set to the messages that actually make a difference?
Many of them will train correctly the first time through. The tte script
should not write them out at the end.
I run tte.py via a shell script wrapper called "tte" (attached) and
currently run it like so:
cd ~/tmp
mv newham.old.cull newham.old
mv newspam.old.cull newspam.old
touch newham
touch newspam
HC=0.02 SC=0.98 RATIO=1:1 tte
new{ham,spam}.old.cull are the files written by tte itself. newham and
newspam are the messages I've saved from my mailer since the last run.
Skip
-------------- next part --------------
#!/bin/bash
# various generated files
## DB=$HOME/tmp/tte.db
LOG=$HOME/tmp/tte.log
# these are tighter than my actual scoring thresholds
HC=${HC:-0.03}
SC=${SC:-0.75}
TTEPY=$HOME/src/spambayes/contrib/tte.py
PYTHON=python
EXPIMP=sb_dbexpimp.py
UNHDR=sb_unheader.py
# can override these from the environment but I never do...
NEWHAM=${NEWHAM:-$HOME/tmp/newham}
NEWSPAM=${NEWSPAM:-$HOME/tmp/newspam}
# output of preliminary cleaning
OLDHAM=${NEWHAM}.old
OLDSPAM=${NEWSPAM}.old
# ratio of spam to ham
RATIO=${RATIO:-1:1}
# there must be at least one new ham or spam message to consider retraining
if [ -f $NEWHAM -o -f NEWSPAM ] ; then
# clean up ham and spam collections, removing spambayes headers
echo cleaning $NEWHAM
touch $NEWHAM
$UNHDR -p 'X-Hammie|X-Spam' $NEWHAM >> $OLDHAM
chmod 600 $OLDHAM
rm $NEWHAM
echo cleaning $NEWSPAM
touch $NEWSPAM
$UNHDR -p 'X-Hammie|X-Spam' $NEWSPAM >> $OLDSPAM
chmod 600 $OLDSPAM
rm $NEWSPAM
echo $TTEPY > $LOG
if $PYTHON $TTEPY \
--ratio=$RATIO \
-g $OLDHAM \
-s $OLDSPAM \
-R \
-o Categorization:ham_cutoff:$HC \
-o Categorization:spam_cutoff:$SC \
-c .cull \
-v 2>> $LOG ; then
chmod 600 $OLDHAM.cull $OLDSPAM.cull
exit $?
else
echo "db generation failed - check tte.log"
exit 1
fi
else
echo "nothing new to train on"
fi
More information about the SpamBayes
mailing list