[Spambayes] Problem with POP3 Proxy: Complains about ham/spam ratio

Mon Aug 9 07:53:40 CEST 2004

> I have trained 984 ham and 3201 spam.
[...]
> I've trained a total of around 12000 emails

Your database thinks that you've trained on exactly 4185 emails.  If this
isn't right, then there might be a problem with the database (that may
manifest later).

> reviewing both the spam and ham and its dead on
> but to prevent the warning about the ratio I've been just discarding
> the spam/unsure and only training on what it thinks is ham.

There's a lot about training that's unknown.  There's some stuff about it on
the wiki <http://entrian.com/sbwiki>, which you might want to read.  One
system that's typically good and reasonably simple to do with sb_server is
training on mistakes.  So you train only on unsures and any false
positives/false negatives.

Ideally, only about 2% of your mail should be unsure (and no fp's, and
almost no fn's).  If this isn't the case, then adjusting the thresholds
might be a viable option.

The mail that is unsure, of course, may not be 50/50 split between ham and
spam, so there'll still be imbalance.  It should take a while before any
significant imbalance is reached, however, and it's easy enough to then
train more ham, if you want to.  Or, if the system is generally going well,
you can discard just about all mail without training.

> Is the 2:1
> ratio in the FAQ just a recommendation or is there a programic reason
> not to exceed it?

It's not a programmatic reason, but a mathematical one.  The statistics that
underlies the calculations gives the most sensible (for this task) results
when the training data is approximately equal.  The exact ratio where things
start to go bad will be different for every set of training & testing data -
the ones that cause the warnings to appear are guesses based on testing &
feedback to this list.  You are free to ignore them - and if it's working
for you, then please do so!

There was once an option that could be turned on to try and compensate for
imbalance.  This was eventually found to cause more problems than it fixed,
though, so it has been removed.  If someone (a statistician, probably) can
come up with a new way to do that, then we would happily give it a go.

Another solution is to use a training regime that enforces balance - like
"train to exhaustion" (tte).  There's a tte.py script in the source
distribution that does this for you, but it's not designed for use with
sb_server.  One of the things that I'd like to look at over the next wee
while (as I get time) is integration of tte with sb_server - not only for
the promise of implicit balance, but because the results are meant to be
better, too.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.