[Spambayes] incremental testing with CL2/CL3?

Jim Bublitz jbublitz@nwinternet.com
Sun, 06 Oct 2002 14:10:28 -0700 (PDT)


On 06-Oct-02 Brad Clements wrote:
> Someone mentioned they did incremental testing and posted their
> results, but I couldn't figure out what the results meant.

That would be me. Apparently nobody could figure out what I wrote.
The short summary is that for my data, running it sequentially
with "daily" retraining gave far better results than any other
testing method, Graham worked slightly better than Spambayes for me
(< 0.3% difference in fp/fn %'s - small), the effect of initial
training size (as low as 1 ham, 1 spam) disappeared after the first
"day".
 
> So, I want to try it too.
 
> I notice in the TestDriver, comments like:
 
>     # CAUTION:  this just doesn't work for incrememental training
> when
>     # options.use_central_limit is in effect.
>     def train(self, ham, spam):
 
> 
> I'm not planning on using untrain(), so does this comment still
> apply?
> 
> my plan is:

I'd suggest:

0. Start with a size-configurable basic training sample.
 
> 1. Receive 100 (configurable) messages "per day", with a
> (configurable) percentage of 
> those being spam.
> 
> 2. run the classifier on those messages and make 3 categories:
> ham, spam, unsure. I 
> want to know how many fall into each category on each "day".
> 
> 3. some percentage (configurable) of each category will be fed
> back into training each 
> "day".
> 
> 4. Plot fn and fp rate "per day" for .. 30 days (configurable) to
> show how rates vary..

I had no errors in 21 day tests (with large enough initial training
sample - otherwise only errors on first "day"). I needed to test 7K
to 8K of *each* type of msg to see any errors in the best case.
Short tests are nice for code debugging/checking the effects of
methodology changes, as in (5) and (6) below.
 
> 5. modulate max_discriminators, training feedback (% of messages
> in each category 
> fed back into system) vs. "days" to get a feel for the results a
> typical user might expect..

The other thing that would be interesting (to me anyway) is if it's
possible/desireable for the system to modify the discrimination
cutoff(s) automatically based on new training data. In other words,
if the system starts at "score > 0.5 is spam", can learning adjust
that number to compensate for changes in newly learned data?

> 6. re-run testing using new classifier schemes.. 
> 
> where do I start?

I'm not sure what other info you need - you seem to have it all in
order. For my data, msg filename == delivery timestamp (one msg
per file), but otherwise you'd probably get the most accurate
ordering from the first encountered "Received" line in the headers
if the msgs aren't already ordered. Otherwise, I instantiated Hammie
(from hammie.py) with hammie.createBayes and just did calls to
Hammie.train, Hammie.update_probabilities, and Hammie.score. I
didn't try "untrain" either - it would be interesting see whether
using that is good or bad.

I also accumulated "weekly" totals thinking I might need to smooth
out "daily" variations, but the error rates are so low, the only
thing it told me is whether the errors occurred early or late in
the sequence. The last test run I did only had two errors - one the
first "day" and one at almost the last "day" (somewhere between
7700 and 8000 ham msgs).


Jim