[Spambayes] train on error - to exhaustion?

Greg Louis glouis at dynamicro.on.ca
Tue Dec 3 16:27:34 2002


On 20021203 (Tue) at 0704:36 -0500, Greg Louis wrote:
> 
> Summarizing,
>   round meanerrpc lcl95 ucl95
> 1     1      6.37  6.13  6.60
> 2     2      5.80  5.56  6.04
> 3     3      5.74  5.50  5.98
> 4     4      5.74  5.50  5.98
> 5     5      5.68  5.44  5.92
> 
> It appears that a second round of training did improve discrimination
> slightly, but after that the law of diminishing returns set in.
> 
> What remains to be done is to start again from scratch and do a full
> training, followed by one round of training-on-error, and run the test
> data against those two training sets to see if the result is any
> different.

       train meanerrpc lcl95 ucl95
1 production      2.11  1.79  2.44
2   errtwice      5.80  5.48  6.12
3       full      5.10  4.78  5.43
4    fullerr      5.10  4.78  5.43

Production refers to my big production training set, just for
comparison; it was full-trained up to about 10k spams and 10k hams and
then trained, not randomly, on every error encountered since.

Errtwice is two rounds of training-on-error with the 6372-of-each
training corpus.  Full is one round of full training with the same
corpus, and fullerr is one round of full training followed by one round
of train-on-error (only 18 spams and 221 nonspams were registered in
that round; although the means are identical, there was some variation
in the individual runs).

Doesn't look as though pure training-on-error is particularly
advantageous with the Robinson-Fisher (chi) calculation method.  It may
still be useful in maintaining the effectiveness of an established
training base.

The above experiment is described more fully at
http://www.bgl.nu/~glouis/bogofilter/training.html

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg@bgl.nu |



More information about the Spambayes mailing list