[Spambayes] Spam Clues: Missing the obvious spam

Thu Apr 22 19:31:40 EDT 2004

> If I were SpamBayes (:-) I would give it 110% immediately
> just because of the P word. Any comments?

Of course, that's both the weakness and the strength of a statistical
approach.  For any given person, "penis" could be as much of a ham clue as a
spam one.  (*Somebody* must be buying those enhancers, or where does the
money come from? <0.5 wink>).  You'll note that "penis" was a very strong
spam clue for you, but it was balanced by some weak ham clues.

For me it scores 100%, but then I only started training afresh yesterday, so
that might be somewhat meaningless.  (Neither "penis" or "small" matter to
me...).  Note also that I have bigrams turned on, and slurp_urls (although
that wasn't used).

> # ham trained on: 964
> # spam trained on: 3270
[...]
> '1:14'                              0.213647            8      7

I suspect this may have an influencing effect.  It's not that high a ratio
(~ 1::3), but "1:14" has been in almost the same number of ham and spam
trained and is a quite strong ham clue.

I think that the next major win for SpamBayes might end up being a solution
to the imbalance issue.  It's possible that this even might be some sort of
"train to exhaustion" method, since that implicitly keeps the database
balanced.

In any case, 87% is pretty high.  Would you even get any false positives if
you set your spam threshold to 80%?  Mine is 80.5% (the .5 is there from a
time a while back when I was testing the ability to have fractional
thresholds), and it does well for me.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.