[Spambayes] train on error - to exhaustion?

Greg Louis glouis at dynamicro.on.ca
Tue Dec 3 12:04:36 2002


On 20021202 (Mon) at 1443:18 -0500, Bill Yerazunis wrote:
> 
>    2) train once on each error, but then repeat the whole training process
>    until all messages are classified correctly?
> 
>    I'd think the latter might be beneficial, but haven't tried it yet
>    myself.
> 
> Hmmm... that would be a good way to do regression checking to 
> verify that every message that is classified correctly once
> is classified correctly forevermore.

I have tried it now.  I started from scratch, with 6372 spams and 6372
nonspams, and did a single pass of training-on-error.  Then I did
second, third, fourth and fifth passes.  Here are the numbers of
messages that had to be trained on each pass:

  rounds spam good
1      1 1090  764
2      2  193   56
3      3   28   15
4      4   10    5
5      5    8    3

Then I took three files of 1624 nonspams each and three files of 617
spams each and ran bogofilter on them with the training db's from
each round of training:

   round run fpos fneg err percent
1      1   0   22  126 148    6.60
2      1   1   17  123 140    6.25
3      1   2   19  121 140    6.25
4      2   0   23  105 128    5.71
5      2   1   18  113 131    5.85
6      2   2   22  109 131    5.85
7      3   0   23  104 127    5.67
8      3   1   18  111 129    5.76
9      3   2   22  108 130    5.80
10     4   0   23  104 127    5.67
11     4   1   18  111 129    5.76
12     4   2   22  108 130    5.80
13     5   0   23  103 126    5.62
14     5   1   19  108 127    5.67
15     5   2   22  107 129    5.76

Summarizing,
  round meanerrpc lcl95 ucl95
1     1      6.37  6.13  6.60
2     2      5.80  5.56  6.04
3     3      5.74  5.50  5.98
4     4      5.74  5.50  5.98
5     5      5.68  5.44  5.92

It appears that a second round of training did improve discrimination
slightly, but after that the law of diminishing returns set in.

What remains to be done is to start again from scratch and do a full
training, followed by one round of training-on-error, and run the test
data against those two training sets to see if the result is any
different.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg@bgl.nu |



More information about the Spambayes mailing list