[Spambayes] train on error - to exhaustion?
Greg Louis
glouis at dynamicro.on.ca
Tue Dec 3 17:23:25 2002
On 20021203 (Tue) at 1157:58 -0500, David Relson wrote:
> By definition, with training-on-error, only some of the
> training corpora are put into the word lists. The obvious result is
> smaller word lists.
I can confirm that. "twice" is the directory where the db files were
built by two rounds of train-on-error:
# ls -l full twice
full:
total 47288
-rw-r--r-- 1 spamtest root 38936576 Dec 3 07:24 goodlist.db
-rw-r--r-- 1 spamtest root 9424896 Dec 3 07:06 spamlist.db
twice:
total 22168
-rw-r--r-- 1 spamtest users 15761408 Dec 2 14:54 goodlist.db
-rw-r--r-- 1 spamtest users 6905856 Dec 2 14:55 spamlist.db
> Other than list size, the effects are less clear. On
> the one hand, incoming messages will have fewer "hits" in the word lists;
> while on the other hand, the hits will be more "meaningful". With the
> smaller lists, there is less "breadth of knowledge" about spam and
> ham. This could account for the lack of advantage of training-on-error.
The fact that you get only half a percent more errors with less than
half the bulk of wordlists does suggest that full training introduces a
lot of unproductive cruft, though. What I _think_ I'm seeing is that,
when done on top of an existing "full" base, training on every error as
it's encountered does quickly improve the discrimination. That's
gut-feeling and could be wrong -- experimentation is needed.
--
| G r e g L o u i s | gpg public key: |
| http://www.bgl.nu/~glouis | finger greg@bgl.nu |
More information about the Spambayes
mailing list