[Spambayes] train on error - to exhaustion?
Greg Louis
glouis at dynamicro.on.ca
Tue Dec 3 16:27:34 2002
On 20021203 (Tue) at 0704:36 -0500, Greg Louis wrote:
>
> Summarizing,
> round meanerrpc lcl95 ucl95
> 1 1 6.37 6.13 6.60
> 2 2 5.80 5.56 6.04
> 3 3 5.74 5.50 5.98
> 4 4 5.74 5.50 5.98
> 5 5 5.68 5.44 5.92
>
> It appears that a second round of training did improve discrimination
> slightly, but after that the law of diminishing returns set in.
>
> What remains to be done is to start again from scratch and do a full
> training, followed by one round of training-on-error, and run the test
> data against those two training sets to see if the result is any
> different.
train meanerrpc lcl95 ucl95
1 production 2.11 1.79 2.44
2 errtwice 5.80 5.48 6.12
3 full 5.10 4.78 5.43
4 fullerr 5.10 4.78 5.43
Production refers to my big production training set, just for
comparison; it was full-trained up to about 10k spams and 10k hams and
then trained, not randomly, on every error encountered since.
Errtwice is two rounds of training-on-error with the 6372-of-each
training corpus. Full is one round of full training with the same
corpus, and fullerr is one round of full training followed by one round
of train-on-error (only 18 spams and 221 nonspams were registered in
that round; although the means are identical, there was some variation
in the individual runs).
Doesn't look as though pure training-on-error is particularly
advantageous with the Robinson-Fisher (chi) calculation method. It may
still be useful in maintaining the effectiveness of an established
training base.
The above experiment is described more fully at
http://www.bgl.nu/~glouis/bogofilter/training.html
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg@bgl.nu |
More information about the Spambayes
mailing list