[Spambayes] Spam to Ham ratio

Nowhere nowhere at cox.net
Thu Feb 12 01:42:22 EST 2004


Thanks for the input. I will check the next 0% scored spam I get to see
what the clues are in reality. I was unaware of a bug. Will it manifest
as showing the outlook plugin score as 0% but inside the clues it is
greater?

AS for the threshold I had already moved it down to 85% and it has
caught almost 100% of the spam. Only an occasional group of unsures will
come through and the occasional 0% spam (which is almost always an
identical spam I have already classified but with randomly placed
punctuation or words.

And yes of the 18 "missed spam" all but one were unsure. The one last
one was that 0% "bug" we are talking about.

I will check a little closer of the next couple days but like I said I
am VERY pleased with the results. Having to look at only 2% of the spam
is fine by me!

Thanks for helping!

Eric

-----Original Message-----
From: Tony Meyer [mailto:tameyer at ihug.co.nz] 
Sent: Wednesday, February 11, 2004 6:27 PM
To: 'Nowhere'; spambayes at python.org
Subject: RE: [Spambayes] Spam to Ham ratio

> I currently have 139 Good and 286 Spam trained.
> I get about 10x more spam than ham. I find that my
> ham is solidly classified at 0-1% while spambayes
> still misses some spam at numbers like 83% (and some
> at 0%).

Are the ones getting 0% a result of the Outlook plug-in bug that does
that?
IOW, if you look at the clues for one of the 0% messages, is it actually
scoring 0%?  (If it is, then that's quite strange).  It that's the case,
then it seems that a simple solution would be to simply move your spam
threshold down to 80%, rather than the default 90%.  (This assumes that
you
don't ever see any unsures that score above 80%).

> These are the spam messages with lots of random words
> thrown in to try to defeat the statistical filters.

Have you looked at the clues for any of these?  It seems likely (and
many
people have found) that the random words won't do anything to help move
it
towards ham.  A random word is most likely to be unknown to your filter,
so
won't be used, and if it is known, has about as much chance of being a
spam
clue as a ham one.  (Unless the words aren't random, and are tailored to
you
personally).  Looking at the spam clues would tell you if it is actually
the
random words that are making the difference.

> Anyway it seems to me that with my ham being recognized
> so perfectly while the spam is less than perfect that
> I would need to classify more spam, further deviating
> from the recommended 1:1 ratio. 
> Or do you think the recognition would work better if I 
> increased my ham messages (even tho they are all coming
> in with 0%)?

Try both, and see what happens.  Most (but not all) of our testing has
shown
that an imbalance hurts, although that usually means a big imbalance,
not a
2::1 sort of thing (which might even help).  Your mail mix is unique to
you,
though, so the only way to know for sure is to try it out.
 
> In anycase of 690 spam I got in the last two days I
> only have to delete as spam 18 of them. Not bad.

[I presume that these have all been unsures, rather than
false-negatives.]
An unsure rate of 2.6% is pretty good - this isn't all that different
from
the rate gained in lots of the testing.  If you can cut even half of
these
by lowering the threshold to 80%, then that's probably as good as it's
going
to get, without changing the code itself.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
This
way, you get everyone's help, and avoid a lack of replies when I'm busy.




More information about the Spambayes mailing list