[Spambayes] Quick question

Jeff Epler jepler at unpythonic.net
Tue Jul 6 17:03:12 CEST 2004


On Tue, Jul 06, 2004 at 01:13:41PM +1000, David Loh wrote:
> Sorry, I am sure this has already been raised but I can't find it in the FAQ. 
> 
> Could someone tell me if there is an optimal number of ham and spam on which to train Spambayes?

I initially trained on about 100 spam and 100 ham.  Since then, I've
mostly followed the "train on error and unsure" methodology, which has
led to a 2:1 imbalance in favor of spam in my database.

The SpamBayes "wiki" contains a page about different training methods.
TOErrors is easy to do, and has given good results for me.  TTE has given
good results for others, but is more difficult to do.  TOEverything is
probably a bad idea for almost everybody.
    <http://www.entrian.com/sbwiki/TrainingIdeas>
If I had to start my database anew, I would be tempted to use TTE or
"non-edge" training, but I'd be most likely to use the same training
method because it's given me pretty good results.

> I have heard that if you exceed a couple of thousand training messages, the efficiency of the filtering goes down. Is this true?

I now have 1502 ham and 2566 spam in my database, and it is still
effective enough for me.  However, the number of messages marked as
"unsure" may be increasing slowly over time.

Here's another message I wrote recently talking about my results using
spambayes:
http://mail.python.org/pipermail/spambayes/2004-June/013546.html

Jeff
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20040706/4c6117eb/attachment.pgp


More information about the Spambayes mailing list