[Spambayes] train on error - to exhaustion?

Tue Dec 3 17:23:25 2002

On 20021203 (Tue) at 1157:58 -0500, David Relson wrote:
> By definition, with training-on-error, only some of the 
> training corpora are put into the word lists.  The obvious result is 
> smaller word lists.

I can confirm that.  "twice" is the directory where the db files were
built by two rounds of train-on-error:

# ls -l full twice
full:
total 47288
-rw-r--r--    1 spamtest root     38936576 Dec  3 07:24 goodlist.db
-rw-r--r--    1 spamtest root      9424896 Dec  3 07:06 spamlist.db

twice:
total 22168
-rw-r--r--    1 spamtest users    15761408 Dec  2 14:54 goodlist.db
-rw-r--r--    1 spamtest users     6905856 Dec  2 14:55 spamlist.db

> Other than list size, the effects are less clear.  On 
> the one hand, incoming messages will have fewer "hits" in the word lists; 
> while on the other hand, the hits will be more "meaningful".  With the 
> smaller lists, there is less "breadth of knowledge" about spam and 
> ham.  This could account for the lack of advantage of training-on-error.

The fact that you get only half a percent more errors with less than
half the bulk of wordlists does suggest that full training introduces a
lot of unproductive cruft, though.  What I _think_ I'm seeing is that,
when done on top of an existing "full" base, training on every error as
it's encountered does quickly improve the discrimination.  That's
gut-feeling and could be wrong -- experimentation is needed.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg@bgl.nu |