[Spambayes] Some details that could be better

Tim Peters tim.peters at gmail.com
Sat Aug 21 09:38:39 CEST 2004


[Lauri Harpf]
> ...
> Does the algorithm rely heavily on obtaining approximately a 50/50
> ratio?

In theory, the algorithm couldn't care less.  This is an empirical
observation, first suspected via anecdote and later confirmed via
testing:  the worse the imbalance, the worse the results, across
several distinct test corpora.

We had an option once to try to do better when imbalance was large,
but it turned out to create worse problems than it solved, so that
code was thrown out.

Thought experiment (which inspired the counterproductive option
mentioned above):  suppose you trained on 1000 spam and 0 ham.  What
then?  Every token in the database would look 100% spammy, and it
would be impossible for any message to score below 50% (which a new
message could achieve by using only tokens that had never been seen
before).

Add 1 ham to that, and it obviously can't get much better.  Or 2, or
3.  How much is enough?  There's no analytical answer we know of.

> According to the most recent survey I have seen, about 80-90% of
> all E-mail traffic on the Internet is spam. Thus, it is quite difficult to
> get even numbers of ham and spam, ...

Training ratio is a question of what you choose to train on, not a
question of the ratio you receive.  I expect that most people could do
far less training and still get excellent results (indeed, many would
get *better* results if they trained on less, especially if that
improved a badly out-of-whack balance).

Note that there's a lot of info about training strategies on the
SpamBayes Wiki, starting here:

   http://www.entrian.com/sbwiki/TrainingIdeas


More information about the Spambayes mailing list