[Spambayes] tons of false positives after upgrading
tameyer at ihug.co.nz
Mon Jan 10 00:42:45 CET 2005
> i had been using version 0.3 of spambayes for a long time
> (XP/outlook express) and it was working fairly well. i
> recently upgraded to 1.0.1,
Wow - that's quite a jump!
> and now i get a ton of false positives (including the
> confirmation and welcome messages from this mailing list !!)
> probably close to 20% of my valid emails are being marked
> as spam.
With such a large jump, the easiest solution, particularly to take advantage
of the various improvements in SpamBayes over that time, would be to retrain
from scratch. Mistake-based training (c.f.
<http://entrian.com/sbwiki/TrainingIdeas>) should result in high accuracy
(certainly higher than you're getting right now) with only a few dozen
> here is the info from the Message Clues page (the "Clues" link)
> from one of the false positives. it is marked as 99.7% spam
> probability! and it appears that it is counting the "spam,"
> string in the 'subject' and 'to' fields as part of the reason
> to consider it spam (?), even though those were added by Spambayes.
The latter is a known bug that will be fixed in 1.1 (it's fixed in CVS).
They are, unfortunately, very strong clues in this example message.
I suspect that maybe one of the reasons for the sudden change is that you
might have been using the experimental ham/spam imbalance option that
SpamBayes used to include, which is completely gone these days. Suddenly
not using that could have quite an impact.
> although 0.754605 7 8
This is a concern - you have trained 7 ham messages and 8 spam messages with
"although" in them, and the score is definitely spam. The most probable
cause for this is the training imbalance (1299::3644 or ~1:2.8), although
that doesn't really seem all that bad (maybe those counts are out? Given
the number of database problems that have been fixed since 0.3 there's a
moderate chance that the database is in shoddy state). Generally a roughly
balanced database is better.
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the Spambayes