[spambayes-dev] Incremental testing

Wed Jan 21 03:47:40 EST 2004

I've finally got around to writing up my latest incremental testing results.
I think I've finally managed to get my head around the incremental setup, so
my earlier results are probably better ignored :)

The results are summarised here, but for all the pretty graphs, see:
<http://www.massey.ac.nz/~tameyer/research/spambayes/incremental.html>
(The page has over 60 graphs, so it may take a little bit to load...)

These testing runs had two aims - to test the various regimes, including a
few that aren't in the CVS copy of regimes.py (the balanced ones), and also
to compare each regime with the experimental bigrams option enabled.

The winner
----------
Expiring data after 30 days did better than keeping it. I suspect this is
because with each of the major changes in spam volume, the new spam was of a
different type, and the expiring regime managed these changes better. It was
still beaten by other regimes, but not by much, and it would be interesting
to try, for example, a 'fpfnunsure' regime that expired as well.

The 'nonedge' regime wins for the most part, except when there was a large
spike in the amount of spam around day 320, at which point it loses.

The 'fpfnunsure' regime seems to be the overall winner, since it almost
matches the 'nonedge' regime most of the time, and does much better after
day 320.

Self-balancing
--------------
None of the self-balancing regime adaptations that I've tried has improved
the results (apart from with nonedge with bigrams, oddly). I'm sure that a
balancing regime could be designed that did help, but it seems that this
code isn't doing it.

Amount of training
------------------
Although it's not displayed in the graphs, the number of messages that were
trained on varied a lot between the regimes. The perfect regime trained on
about 12000 messages, or 32.9/day, which is much, much, more than any of the
partial-training regimes. Interestingly, the balanced options trained on
about the same number of total messages as the non-balanced options
(indicating that little was gained from the balancing, I think). The nonedge
regime trained on about 850 messages, or 2.3/day, just under the nonedge
regime with bigrams, which took more messages - about 1050, or 2.9/day.

Bigrams
-------
Bigrams hurt the results of balanced_perfect, imbalanced_perfect,
fpfnunsure, and nonedge. Bigrams helped the results of balanced_nonedge,
perfect, balanced_corrected, expire1month, and corrected.

When bigrams won, it tended to reduce false-positives a lot, false negatives
a little, and increase unsures. Using bigrams the results for each set did
not differ nearly so much, and, as a result, the regimes were much more
clearly delimited.

What does this say about using bigrams? Pretty much that more research
(particularly with other corpora) is needed, IMO. <wink>

BTW, to compare, I ran a timcv.py -n5 test with the same data. Bigrams
easily won.

=Tony Meyer