[Spambayes] How many is enough?

Meyer, Tony T.A.Meyer at massey.ac.nz
Mon May 12 14:41:45 EDT 2003


> I've read the pages at http://spambayes.sourceforge.net/ now 
> and concluded that you should train your database, but not too much.
> What I fail to find is some numbers for this. Are we talking 
> about hundreds or thousands or millions?
> I've trained my database with 3000 ham and only 50 spam. That 
> was basically all I had available in my email client at the moment.
> So, how much should I train before I run the risk of overdoing it?

The more you train, the better, in general.  However:
  * If you have many more of ham/spam than spam/ham, this can be bad.
    (however, if you enable the experimental_ham_spam_imbalance option,
    this shouldn't matter as much, although it hasn't been tested as
    much as it could be).
  * 50 spam is fairly low.  It wouldn't be that surprising to get some
    incorrect results with that few, but it should still do a reasonable
    job.

>From my experience, I would say that you should have a couple of hundred
of each to get acceptable results (I only use a corpus of about 400 each
in Outlook).  If you have thousands, then you'll probably get more
accurate results, but I imagine that the utility of adding another
message to the corpus gets lower as the corpus gets bigger.

IIRC, if you look through the archives there's a post that has a
reference to a webpage that has graphs for different corpus sizes.  That
might be of interest.

=Tony Meyer



More information about the Spambayes mailing list