[Spambayes] incremental testing with CL2/CL3?

Mon, 07 Oct 2002 01:30:41 -0400

[Brad Clements]
> Someone mentioned they did incremental testing and posted their
> results, but I couldn't
> figure out what the results meant.
>
> So, I want to try it too.
>
> I notice in the TestDriver, comments like:
>
>     # CAUTION:  this just doesn't work for incrememental training when
>     # options.use_central_limit is in effect.
>     def train(self, ham, spam):
>
>
> I'm not planning on using untrain(), so does this comment still apply?

I replied to this before with "sorry, yes", but this issue needs to be
forced, and I checked in changes so we can at least *try* this.

Let me explain the problem:

Under the all-default scheme, the only thing we remember about training msgs
is now many msgs each word appears in.  That's all.  Given any msg, we can
add it or remove it at will, and the only affect it has is on the
ord->hamcount and word->spamcount maps (from which we guess probabilities).

The central limit schemes are quite different this way:  we not only save
word->hamcount and word->spamcount maps (and in exactly the same way, so no
problem there), we also do a third training pass
(.central_limit_compute_population_stats{,2,3}) under the covers.  This
looks for the set of "extreme words" in each training message (which can't
be known until after update_probabilities() completes), and saves away
statistics about their probabilities, one set of statistics for all the ham
messages trained on, and a parallel, distinct set for all the spam messages
trained on.

The problem with incremental training under the clt schemes is in that third
pass:  when you train on any new data:

1. The word->hamcount and word->spamcount maps change.

2. This in turn changes word probabilities.  The word probabilities
   that *were* used in the third training pass for *previous* data
   are no onger current, and so the statistics computed from them are
   also incorrect for the new state of the world.

3. Changing word probabilities can in turn even change the *set*
   of extreme words in a msg.  And again, the set of extreme words
   found by the third training pass for previous data may not even
   be the correct extreme words for the new state of the world.

There's simply no way to repair #2 and #3 short of recomputing them from
scratch for every msg ever trained on, and that requires feeding them all
into the system again (or a moral equivalent, like storing, for each msg
ever trained on, the set of tokens it generated).

In particular, as time goes on the probabilities computed in #2 get more
extreme (closer to 0.0 and closer to 1.0) for strong clues, and clt2 and
clt3 in particular make extreme use of extreme words.  clt1 is less
sensitive that way.  This implies that, if you don't retrain on every msg,
the mild spamprobs in the msgs first trained on will forever after drag down
the statistics toward neutrality.

There are two hacks I can think of to try, short of retraining on every msg
ever seen:

1. Just keep adding in new statistics, and don't worry about the
   moderating effects of the early msgs.  The code as checked in now
   will do this:  so long as you don't call new_classifier(), each
   time train() is called it justs adds the new statistics to the
   old ones (before I checked in the changes, it overwrite the
   old statistics, as if they had never existed).

2. Indeed simply overwrite the old statistics.  This is as if the
   third training pass had never been done for older messages.

My intuition (which isn't worth much!) is that #2 is quirkier and riskier,
making much of the effect of the central-limit gimmicks depend solely on the
last batch of msgs trained on.  #1 should have much greater stability over
time, but that's not necessarily a good thing if the stability is bought at
the cost of not moving quickly enough toward the true state of the world.

Anyway, the only way to know is to try it.

> my plan is:
>
> 1. Receive 100 (configurable) messages "per day", with a
>   (configurable) percentage of those being spam.

You're ordering these by time received, right?

> 2. run the classifier on those messages and make 3 categories:
>    ham, spam, unsure.  I want to know how many fall into each
>    category on each "day".

I would like to see eight categories instead:

     ham sure correct
     ham sure incorrect
     ham unsure correct
     ham unsure incorrect

and the same four for spam.

> 3. some percentage (configurable) of each category will be fed
>    back into training each "day".

There's a world of interesting variations here <wink>.  For example, what if
you only feed it "sure but wrong" false positives and false negatives?  Or
only those plus "unsure but wrong" mistakes?  Or only the latter?  Etc.
Semi-realistic is to feed it all mistakes, and a random sampling from
correct results.  It's hard to know what people would really do, but I'm
*most* interested at first in what happens if intelligent use of the system
is made.

> 4. Plot fn and fp rate "per day" for .. 30 days (configurable) to
>    show how rates vary..

Note that there two f-n and two f-p rates under the clt schemes (the "sure"
and "unsure" mistake rates).

> 5. modulate max_discriminators, training feedback (% of messages
>   in each category fed back into system) vs. "days" to get a feel
>   for the results a typical user might expect..

Like such a beast exists <wink>.  I know one of my sisters well enough to
guess that she would feed it every false negative, and nothing else.

> 6. re-run testing using new classifier schemes..
>
> where do I start?

At step #1 <wink>.  You'll need a custom test driver, but those are easy
enough to write.  Really stare at the differences between, e.g., timtest.py
and timcv.py:  the differences between strategies as different as a grid
driver and a cross-validation driver amount to a few dozen lines of code in
one function.

For this, something like:

d = TestDriver.Driver()
ham, spam = some initial set of msgs to get things started
d.train(ham, spam)

for day in range(number_of_days):
    ham, spam = get the day's new msgs
    d.test(ham, spam)
    d.finishtest()
    print out whatever stats you want, athough d.finishtest()
        automatically prints out all the stuff you're interested
        in, so this may be much more a matter of writing a
        custom output analyzer; inferring the 4 error rates
        from pairs of 4-line histograms would be a PITA that we
        could make easier (adding new "-> <stat>" lines is easy, and
        harmless so long as they're not easily confusable with
        the lines of this kind other programs are already
        extracting)
    ham2, spam2 = the msgs from ham & spam you want to train on
    d.train(ham2, spam2)
d.alldone()