[Spambayes] Spam to Ham ratio

Wed Feb 11 20:26:33 EST 2004

> I currently have 139 Good and 286 Spam trained.
> I get about 10x more spam than ham. I find that my
> ham is solidly classified at 0-1% while spambayes
> still misses some spam at numbers like 83% (and some
> at 0%).

Are the ones getting 0% a result of the Outlook plug-in bug that does that?
IOW, if you look at the clues for one of the 0% messages, is it actually
scoring 0%?  (If it is, then that's quite strange).  It that's the case,
then it seems that a simple solution would be to simply move your spam
threshold down to 80%, rather than the default 90%.  (This assumes that you
don't ever see any unsures that score above 80%).

> These are the spam messages with lots of random words
> thrown in to try to defeat the statistical filters.

Have you looked at the clues for any of these?  It seems likely (and many
people have found) that the random words won't do anything to help move it
towards ham.  A random word is most likely to be unknown to your filter, so
won't be used, and if it is known, has about as much chance of being a spam
clue as a ham one.  (Unless the words aren't random, and are tailored to you
personally).  Looking at the spam clues would tell you if it is actually the
random words that are making the difference.

> Anyway it seems to me that with my ham being recognized
> so perfectly while the spam is less than perfect that
> I would need to classify more spam, further deviating
> from the recommended 1:1 ratio. 
> Or do you think the recognition would work better if I 
> increased my ham messages (even tho they are all coming
> in with 0%)?

Try both, and see what happens.  Most (but not all) of our testing has shown
that an imbalance hurts, although that usually means a big imbalance, not a
2::1 sort of thing (which might even help).  Your mail mix is unique to you,
though, so the only way to know for sure is to try it out.

> In anycase of 690 spam I got in the last two days I
> only have to delete as spam 18 of them. Not bad.

[I presume that these have all been unsures, rather than false-negatives.]
An unsure rate of 2.6% is pretty good - this isn't all that different from
the rate gained in lots of the testing.  If you can cut even half of these
by lowering the threshold to 80%, then that's probably as good as it's going
to get, without changing the code itself.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.