[Spambayes] tons of false positives after upgrading

Tony Meyer tameyer at ihug.co.nz
Mon Jan 10 00:42:45 CET 2005


> i had been using version 0.3 of spambayes for a long time
> (XP/outlook express) and it was working fairly well.  i 
> recently upgraded to 1.0.1,

Wow - that's quite a jump!

> and now i get a ton of false positives (including the
> confirmation and welcome messages from this mailing list !!)
> probably close to 20% of my valid emails are being marked
> as spam.

With such a large jump, the easiest solution, particularly to take advantage
of the various improvements in SpamBayes over that time, would be to retrain
from scratch.  Mistake-based training (c.f.
<http://entrian.com/sbwiki/TrainingIdeas>) should result in high accuracy
(certainly higher than you're getting right now) with only a few dozen
messages trained.

> here is the info from the Message Clues page (the "Clues" link)
> from one of the false positives.  it is marked as 99.7% spam
> probability!  and it appears that it is counting the "spam,"
> string in the 'subject' and 'to' fields as part of the reason
> to consider it spam (?), even though those were added by Spambayes.

The latter is a known bug that will be fixed in 1.1 (it's fixed in CVS).
They are, unfortunately, very strong clues in this example message.

I suspect that maybe one of the reasons for the sudden change is that you
might have been using the experimental ham/spam imbalance option that
SpamBayes used to include, which is completely gone these days.  Suddenly
not using that could have quite an impact.

> although 0.754605 7 8 

This is a concern - you have trained 7 ham messages and 8 spam messages with
"although" in them, and the score is definitely spam.  The most probable
cause for this is the training imbalance (1299::3644 or ~1:2.8), although
that doesn't really seem all that bad (maybe those counts are out?  Given
the number of database problems that have been fixed since 0.3 there's a
moderate chance that the database is in shoddy state).  Generally a roughly
balanced database is better.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.



More information about the Spambayes mailing list