[Spambayes] Spam to Ham ratio
nowhere at cox.net
Thu Feb 12 01:42:22 EST 2004
Thanks for the input. I will check the next 0% scored spam I get to see
what the clues are in reality. I was unaware of a bug. Will it manifest
as showing the outlook plugin score as 0% but inside the clues it is
AS for the threshold I had already moved it down to 85% and it has
caught almost 100% of the spam. Only an occasional group of unsures will
come through and the occasional 0% spam (which is almost always an
identical spam I have already classified but with randomly placed
punctuation or words.
And yes of the 18 "missed spam" all but one were unsure. The one last
one was that 0% "bug" we are talking about.
I will check a little closer of the next couple days but like I said I
am VERY pleased with the results. Having to look at only 2% of the spam
is fine by me!
Thanks for helping!
From: Tony Meyer [mailto:tameyer at ihug.co.nz]
Sent: Wednesday, February 11, 2004 6:27 PM
To: 'Nowhere'; spambayes at python.org
Subject: RE: [Spambayes] Spam to Ham ratio
> I currently have 139 Good and 286 Spam trained.
> I get about 10x more spam than ham. I find that my
> ham is solidly classified at 0-1% while spambayes
> still misses some spam at numbers like 83% (and some
> at 0%).
Are the ones getting 0% a result of the Outlook plug-in bug that does
IOW, if you look at the clues for one of the 0% messages, is it actually
scoring 0%? (If it is, then that's quite strange). It that's the case,
then it seems that a simple solution would be to simply move your spam
threshold down to 80%, rather than the default 90%. (This assumes that
don't ever see any unsures that score above 80%).
> These are the spam messages with lots of random words
> thrown in to try to defeat the statistical filters.
Have you looked at the clues for any of these? It seems likely (and
people have found) that the random words won't do anything to help move
towards ham. A random word is most likely to be unknown to your filter,
won't be used, and if it is known, has about as much chance of being a
clue as a ham one. (Unless the words aren't random, and are tailored to
personally). Looking at the spam clues would tell you if it is actually
random words that are making the difference.
> Anyway it seems to me that with my ham being recognized
> so perfectly while the spam is less than perfect that
> I would need to classify more spam, further deviating
> from the recommended 1:1 ratio.
> Or do you think the recognition would work better if I
> increased my ham messages (even tho they are all coming
> in with 0%)?
Try both, and see what happens. Most (but not all) of our testing has
that an imbalance hurts, although that usually means a big imbalance,
2::1 sort of thing (which might even help). Your mail mix is unique to
though, so the only way to know for sure is to try it out.
> In anycase of 690 spam I got in the last two days I
> only have to delete as spam 18 of them. Not bad.
[I presume that these have all been unsures, rather than
An unsure rate of 2.6% is pretty good - this isn't all that different
the rate gained in lots of the testing. If you can cut even half of
by lowering the threshold to 80%, then that's probably as good as it's
to get, without changing the code itself.
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
way, you get everyone's help, and avoid a lack of replies when I'm busy.
More information about the Spambayes