[Spambayes] training problem?

Tue Dec 2 19:01:25 EST 2003

[Kenny Pitt]
> > Message Tokens:
> >
> > 684 unique tokens
>
> SpamBayes will use at most 150 tokens to determine the spam probability,
> while the complete message has 684.  SpamBayes chooses the 150 strongest
> tokens (i.e. those with probabilities farthest from a neutral 0.5), and
> the rest are not used so are only shown in the Message Tokens section.
> SpamBayes also ignores any tokens that don't have a probability <0.4 or
> >0.6.

[Tim Peters]
> That's right.  Note that this 150 is the default value of the Classifier's
> max_discriminators option.  Setting it much higher than that can cause
> numerical problems in the inverse chi-squared probability computation,
> specifically at the
>
>     # XXX If x2 is very large, exp(-m) will underflow to 0.
>
> comment in chi2Q().  Testing showed that the exact value of
> max_discriminators didn't matter much, provided it was at least
> 30 (or so).
> Then again, most emails don't have 150 tokens, let alone 150 strong ones.

Thanks to both of you for clearing this up.

The present problem I am fighting is false negatives.  The two messages I
posted about in this thread were just examples.  Performance is obviously
highly dependent on initial training set size and subsequent training
strategy, but I have not done terribly well with false negatives (yet!).  I
now have two weeks worth of data using the following tactics:

1) Initial training set 650 spam, 654 ham on 11-16-03.

2) Initial filter thresholds 90/15.

3) Train on any spam that scores below 50, any ham that scores above 15.
Filter all unread mail after each training event to simulate

4) On 11-22-03, changed filter thresholds to 90/5.  Train on any ham that
scores above 5.  Trained 154 additional ham to rebalance databases.  In
reality, very, very few of the false negatives scored between 5 and 15, so
the threshold change did not make a large difference.

5) On 11-29-03, trained 118 additional ham to rebalance databases.

Here are my results:

  date     spam   fn    fn%  fp   fp%  comments
--------   ----   --  -----  --  ----  --------
11-17-03    137   18  13.1%   0  0.0%  first full day after training
11-18-03    157   14   8.9%   0  0.0%
11-19-03    135   11   8.2%   0  0.0%
11-20-03    157   13   8.3%   0  0.0%
11-21-03    147    9   6.1%   0  0.0%
11-22-03    166    8   4.8%   0  0.0%  trained 154 add'l ham, lowered ham
threshold
11-23-03    164   11   6.7%   0  0.0%
11-24-03    146    3   2.1%   0  0.0%
11-25-03    154    5   3.3%   0  0.0%
11-26-03    133    3   2.3%   0  0.0%
11-27-03    134    0   0.0%   0  0.0%
11-28-03    135    8   5.9%   0  0.0%
11-29-03    152    7   4.6%   0  0.0%  trained 118 add'l ham
11-30-03    138    6   4.4%   0  0.0%
12-01-03    157    9   5.7%   0  0.0%
12-02-03    106    8   7.6%   0  0.0%  partial day, not yet complete

SpamBayes currently has trained 926 ham and 929 spam.  The very good news is
no false positives, and that seems to be the forte of this program.  It
appears that the system reached an optimum around 11-27-03 and has gotten
worse after that.  Alternately, you could interpret this as stabilized by
11-21-03 with a few unusually good days following that.  This false negative
rate is similar to the results I had before, though I did not use a
pre-defined training scheme as I do now.  My questions are:

1) Is this typical or should I expect better?

2) What training tactics would you suggest that might work better?

Under the assumption that the basic classifier has undergone lots of testing
and is well-optimized, my guess is that most future performance
improvements, aside from bug fixes and parsing changes, will result from
training strategy.  Hoping that this is not completely misguided, I put some
ideas on training tactics on the wiki at
http://entrian.com/sbwiki/TrainingIdeas.  Comments, corrections and feedback
would be most appreciated.  I have no idea how many of these ideas have
already been tried and the results known.  As I don't care to waste other
people's time with old or naive ideas, let me know if that wiki discussion
out to lunch and I'll either fix it or rip it down.

--
Seth Goodman

  Humans:   personal replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above