[Spambayes] How many is enough?
Meyer, Tony
T.A.Meyer at massey.ac.nz
Mon May 12 14:41:45 EDT 2003
> I've read the pages at http://spambayes.sourceforge.net/ now
> and concluded that you should train your database, but not too much.
> What I fail to find is some numbers for this. Are we talking
> about hundreds or thousands or millions?
> I've trained my database with 3000 ham and only 50 spam. That
> was basically all I had available in my email client at the moment.
> So, how much should I train before I run the risk of overdoing it?
The more you train, the better, in general. However:
* If you have many more of ham/spam than spam/ham, this can be bad.
(however, if you enable the experimental_ham_spam_imbalance option,
this shouldn't matter as much, although it hasn't been tested as
much as it could be).
* 50 spam is fairly low. It wouldn't be that surprising to get some
incorrect results with that few, but it should still do a reasonable
job.
>From my experience, I would say that you should have a couple of hundred
of each to get acceptable results (I only use a corpus of about 400 each
in Outlook). If you have thousands, then you'll probably get more
accurate results, but I imagine that the utility of adding another
message to the corpus gets lower as the corpus gets bigger.
IIRC, if you look through the archives there's a post that has a
reference to a webpage that has graphs for different corpus sizes. That
might be of interest.
=Tony Meyer
More information about the Spambayes
mailing list