[spambayes-dev] RE: [Spambayes] How low can you go?

Tue Dec 30 11:32:18 EST 2003

> [T. Alexander Popiel]
> One thing that's occurred to me is to have the training cutoffs at
> N sigma from mean (where N == .5?) for the two populations; how you'd
> bootstrap that is an open question, of course.

Great idea.  The first pass could just be set to two constant thresholds,
then start computing the mean, SD and new thresholds.  This should converge
fairly quickly.

Another idea is to use the two means, but decide how many SD's to go for
each one based on the incoming ham/spam ratio.  This requires you to make an
assumption about the distributions.

Along the same lines, one more possibility is to construct a cumulative
distribution function (CDF) of new mail received, then set the training
thresholds such that you would train an equal number of ham/spam.  This also
lets you set the total number of messages trained, or at least to limit it
to a maximum value.  Since this is a batch (nightly?) process rather than
continuous, the CDF calculation is a posteriori so both the ratio and number
of new trained messages will be achieved exactly.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above