[Spambayes] train on error - to exhaustion?

Greg Louis glouis at dynamicro.on.ca
Tue Dec 3 17:11:10 2002


On 20021203 (Tue) at 1153:10 -0500, Tim Peters wrote:
> [Greg Louis]
> > ...
> > Doesn't look as though pure training-on-error is particularly
> > advantageous with the Robinson-Fisher (chi) calculation method.
> 
> Are you hashing tokens?  spambayes does not, CRM114 does.  Bill generates
> about 16 hash codes per input token, and with just a million hash buckets,
> collision rates zoom quickly if you train on everything.

Understood.  We don't hash tokens, and I agree that the sentence you
quoted is misleading; I should have said something like "bogofilter's
current tokenization and the R-F classification method."  I didn't try
any of bogofilter's other calculation methods.

> The experiments spambayes did with CRM114-like schemes were a
> disaster due to this -- we continued to train on everything, with
> hashing but without any bounds on bucket count, and the hash
> collisions quickly caused outrageously bad classification mistakes. 
> Removing the hashing cured that, but then the database size goes
> through the roof (when generating ~16 "exact strings" per input
> token, and training on everything).

Yup.
 
> Training-on-error helps Bill because it slashes hash collisions, simply via
> producing far fewer hash codes than does training on everything.

I didn't mean to imply otherwise, and your correction of my sloppy
wording is appreciated.

> Experiments in the default non-hashing spambayes unigram code found that
> train-on-error hurt the unsure rate but not the FP or FN rates.
> 
> > It may still be useful in maintaining the effectiveness of an established
> > training base.
> 
> Possibly; we didn't do any experiments on that.

Neither have I; I've been doing it in practice and it seems to work (my
fp/fn are coming down), but I would like to perform a properly-designed
experiment to assess it.

-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg@bgl.nu |



More information about the Spambayes mailing list