[spambayes-dev] Another incremental training idea...

Seth Goodman nobody at spamcop.net
Tue Jan 13 18:03:19 EST 2004


Knowing that cross-posting is poor netiquette, here is a copy of a post on
incremental training I made in an unrelated thread on sb_server in the
SpamBayes forum.

----------------------------------------------------

[Anthony Baxter]
> I have to wonder if making non-edge the default option in the next
> release of the code (with advice to toss the training database) isn't
> a bad plan.

Probably not a bad plan at all.  FWIW, I have been running a related regime
manually with the plug-in for a while and it works very well.  Here's my
setup:

I originally trained with an incrementally selected reduced training set
based on a corpus of around 7800 spam and 2500 ham.  The incremental
selection was as follows: add the 5 worst-scoring messages of each type to
the training set, rescore the corpus, repeat until all non-trained spam in
the corpus scored at least 90% (the hams quickly approached 0.0%).  The
resulting training set was 640 spam/640 ham (about 12% of the total corpus).
With thresholds of 80/5, there were only three unsures in the entire corpus
of 10K+ messages (the unsures were all trained messages: one ham, two spam).
I then start with this training set and "train on almost everything" daily
with asymmetric training thresholds (90/0.1) to partly mimic the original
training scheme.  Using classification thresholds of 80/5, my composite
stats with this manual regime since 12/16/2003 have been:

total spam  2889
total ham    376
fn             8   0.28%
fp             0   0.00%
unsure       133   4.07%

The ham number is so low because I use Outlook rules to siphon off all the
mailing list traffic before the classifier starts.  Virtually all unsures
were spam (I haven't tracked it, but the number of unsure ham was certainly
less than 5).  The only issue I've have with this regime previously is that
after a while, the performance goes down (unsures increase).  Presumably,
this is because reduced training sets are hapax-driven and are very
sensitive to exactly which messages are trained, but that's just a guess.
You then have to go back to the original spam corpus with the new messages
added and tweak the training set to get performance back up.  A larger
training set based on train on almost everything would probably have fewer
unsures.

These fp and fn results are encouraging, while the unsure rate is mediocre.
With nham so low, we can't have much confidence in the measured fp rate (I
don't know the distribution, but the ham scores are tightly grouped around
zero; does anyone know how to calculate the SD of the fp estimate based on
this number of nham?).

I am fooling around with variable expiration times based on how "wrong" a
particular message classification was to see if I can possibly keep a
reduced training set up-to-date automatically and if that is, in fact, a
reasonable thing to do at all.  Maybe the pure train on almost everything
regime with a long time expiration (like Alex's four months) will be the
ultimate?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above




More information about the spambayes-dev mailing list