[Spambayes] training problem?

Tue Dec 2 21:18:24 EST 2003

[Seth Goodman]
> The present problem I am fighting is false negatives.

What do you mean by false negative?  We use it here to mean spam scoring
below your ham cutoff.

> The two messages I posted about in this thread were just examples.

One would have sufficed <wink>.

> Performance is obviously highly dependent on initial training set
> size and subsequent training strategy, but I have not done terribly
> well with false negatives (yet!).  I now have two weeks worth of data
> using the following tactics:
>
> 1) Initial training set 650 spam, 654 ham on 11-16-03.
>
> 2) Initial filter thresholds 90/15.

So by "false negative" here you mean spam scoring below 15?  If so, I have
no theory, as I see maybe one of those per month (with about 700 emails per
day, including 200-250 daily spam).

> 3) Train on any spam that scores below 50, any ham that scores above
> 15. Filter all unread mail after each training event to simulate

If your spam cutoff is 90, why do you only train on spam scoring below 50?
Something doesn't sound right here.

> 4) On 11-22-03, changed filter thresholds to 90/5.  Train on any ham
> that scores above 5.  Trained 154 additional ham to rebalance
> databases.  In reality, very, very few of the false negatives scored
> between 5 and 15, so the threshold change did not make a large
> difference.

Sorry, still don't know what you mean by false negative.  If you meant the
conventional "scored below 15" (your former ham cutoff), yet very, very few
of them scored between 5 and 15, it must mean that almost all of your false
negatives are scoring below 5.  Is that what you mean?

> 5) On 11-29-03, trained 118 additional ham to rebalance databases.
>
> Here are my results:
>
>   date     spam   fn    fn%  fp   fp%  comments
> --------   ----   --  -----  --  ----  --------
> 11-17-03    137   18  13.1%   0  0.0%  first full day after training
> 11-18-03    157   14   8.9%   0  0.0%
> 11-19-03    135   11   8.2%   0  0.0%
> 11-20-03    157   13   8.3%   0  0.0%
> 11-21-03    147    9   6.1%   0  0.0%
> 11-22-03    166    8   4.8%   0  0.0%  trained 154 add'l ham, lowered
> ham threshold
> 11-23-03    164   11   6.7%   0  0.0%
> 11-24-03    146    3   2.1%   0  0.0%
> 11-25-03    154    5   3.3%   0  0.0%
> 11-26-03    133    3   2.3%   0  0.0%
> 11-27-03    134    0   0.0%   0  0.0%
> 11-28-03    135    8   5.9%   0  0.0%
> 11-29-03    152    7   4.6%   0  0.0%  trained 118 add'l ham
> 11-30-03    138    6   4.4%   0  0.0%
> 12-01-03    157    9   5.7%   0  0.0%
> 12-02-03    106    8   7.6%   0  0.0%  partial day, not yet complete
>
> SpamBayes currently has trained 926 ham and 929 spam.  The very good
> news is no false positives, and that seems to be the forte of this
> program.

I expect it varies by person, but yes, I most often hear that people who
have used many spam gimmicks are most surprised by this gimmick's
low-to-zero FP rate.

> It appears that the system reached an optimum around 11-27-03 and has
> gotten worse after that.  Alternately, you could interpret this as
> stabilized by 11-21-03 with a few unusually good days following that.
> This false negative rate is similar to the results I had before, though
> I did not use a pre-defined training scheme as I do now.  My questions
> are:
>
> 1) Is this typical

The only believable answer to that would have to come from broad testing.
At best, we had a peak of about a dozen active testers here, but half the
most active were focused on high-volume email filtering, not personal
application.  IOW, the broad testing needed to answer that question has
never been done.

> or should I expect better?

Ditto.  My own FP and FN rates are trivial (I'm genuinely surprised to see
any spam in my Inbox, and shocked to see a ham in my Spam folder, using
cutoffs of 20 and 80).  My Unsure rate (scores between 20 and 80) is heading
toward 5% -- but I don't care (I review all my spam anyway, and I'm on
enough admin-type mailing lists that I get a ton of weird email -- I can't
myself decide whether fully half the stuff in my Unsure folder is "really
ham" or "really spam", and toss it untrained after mentally shrugging).

> 2) What training tactics would you suggest that might work better?

Until we know you meant by false negative, none.  If you're calling spam
that ends up Unsure "false negative", then reducing your spam cutoff should
help.  If you really are getting lots of spam scoring below 5, then that's
something I've never heard of before (anyone?).

> Under the assumption that the basic classifier has undergone lots of
> testing and is well-optimized, my guess is that most future
> performance improvements, aside from bug fixes and parsing changes,
> will result from training strategy.  Hoping that this is not
> completely misguided, I put some ideas on training tactics on the
> wiki at http://entrian.com/sbwiki/TrainingIdeas.  Comments,
> corrections and feedback would be most appreciated.  I have no idea
> how many of these ideas have already been tried and the results
> known.  As I don't care to waste other people's time with old or
> naive ideas, let me know if that wiki discussion out to lunch and
> I'll either fix it or rip it down.

Thanks!  That's an excellent use for the Wiki.  People who disagree can add
their disagreements directly to the Wiki page.  Wikis are best when the
authors cooperate to change the text as agreements appear, so it doesn't end
up as a static ever-growing argument <wink>.