[Spambayes] train-to-exhaustion questions

Thu Apr 26 20:15:11 CEST 2007

    David> 1. A recent training run went like this:

    David>   round:  1, msgs:  690, ham misses:  61, spam misses: 210, 176.3s
    David>   round:  2, msgs:  690, ham misses:   8, spam misses:  53, 165.6s
    David>   round:  3, msgs:  690, ham misses:   1, spam misses:   7, 159.6s
    David>   round:  4, msgs:  690, ham misses:   1, spam misses:   2, 159.6s
    David>   round:  5, msgs:  690, ham misses:   0, spam misses:   1, 157.8s
    David>   round:  6, msgs:  690, ham misses:   1, spam misses:   1, 160.9s
    David>   round:  7, msgs:  690, ham misses:   0, spam misses:   1, 211.0s
    David>   round:  8, msgs:  690, ham misses:   0, spam misses:   1, 172.6s
    David>   round:  9, msgs:  690, ham misses:   0, spam misses:   1, 197.1s
    David>   round: 10, msgs:  690, ham misses:   1, spam misses:   1, 174.6s

    David>   It seems that the results got *worse* in rounds 6 and 10.  Am I
    David>   misinterpreting this?  Are these expected results?

I would look through your log files, probably near the end, to see if there
is some message that is either a mistake or extremely hard to properly
classify.  One type of spam that gives the tte algorithm fits is a very
short spam sent to an otherwise normally well-behaved mailing list.  You
have a bunch of hammy header clues and one (or at most a few) spammy clues.
The first several times through that message will probably score as unsure.
It will thus get retrained over and over.  After awhile, the hammy header
clues will get less hammy, perhaps some even spammy.  That will drag some
previously ok ham messages into the unsure category.  A vicious cycle
ensues.  When this happens I just delete that problematic spam from the
database and retrain.

    David> 2. I have about 350 each of ham and spam that I can use to train
    David>    on.  I'm sure that some of these messages are mostly redundant
    David>    and add little or nothing of value to the training data.  I
    David>    don't want to waste time on them every time I do a training
    David>    run.  Is there some way to use tte.py to reduce my training
    David>    set to the messages that actually make a difference?

Many of them will train correctly the first time through.  The tte script
should not write them out at the end.

I run tte.py via a shell script wrapper called "tte" (attached) and
currently run it like so:

    cd ~/tmp
    mv newham.old.cull newham.old
    mv newspam.old.cull newspam.old
    touch newham
    touch newspam
    HC=0.02 SC=0.98 RATIO=1:1 tte

new{ham,spam}.old.cull are the files written by tte itself.  newham and
newspam are the messages I've saved from my mailer since the last run.

Skip

-------------- next part --------------
#!/bin/bash

# various generated files
## DB=$HOME/tmp/tte.db
LOG=$HOME/tmp/tte.log

# these are tighter than my actual scoring thresholds
HC=${HC:-0.03}
SC=${SC:-0.75}

TTEPY=$HOME/src/spambayes/contrib/tte.py

PYTHON=python
EXPIMP=sb_dbexpimp.py
UNHDR=sb_unheader.py

# can override these from the environment but I never do...
NEWHAM=${NEWHAM:-$HOME/tmp/newham}
NEWSPAM=${NEWSPAM:-$HOME/tmp/newspam}

# output of preliminary cleaning
OLDHAM=${NEWHAM}.old
OLDSPAM=${NEWSPAM}.old

# ratio of spam to ham
RATIO=${RATIO:-1:1}

# there must be at least one new ham or spam message to consider retraining
if [ -f $NEWHAM -o -f NEWSPAM ] ; then
    # clean up ham and spam collections, removing spambayes headers
    echo cleaning $NEWHAM
    touch $NEWHAM
    $UNHDR -p 'X-Hammie|X-Spam' $NEWHAM >> $OLDHAM
    chmod 600 $OLDHAM
    rm $NEWHAM

    echo cleaning $NEWSPAM
    touch $NEWSPAM
    $UNHDR -p 'X-Hammie|X-Spam' $NEWSPAM >> $OLDSPAM
    chmod 600 $OLDSPAM
    rm $NEWSPAM

    echo $TTEPY > $LOG
    if $PYTHON $TTEPY \
        --ratio=$RATIO \
          -g $OLDHAM \
          -s $OLDSPAM \
          -R \
          -o Categorization:ham_cutoff:$HC \
          -o Categorization:spam_cutoff:$SC \
          -c .cull \
          -v 2>> $LOG ; then
        chmod 600 $OLDHAM.cull $OLDSPAM.cull
        exit $?
    else
        echo "db generation failed - check tte.log"
        exit 1
    fi
else
    echo "nothing new to train on"
fi