[Spambayes] tons of false positives after upgrading
tim.peters at gmail.com
Mon Jan 10 06:35:52 CET 2005
> i had been using version 0.3 of spambayes for a long time (XP/outlook
> express) and it was working fairly well. i recently upgraded to 1.0.1, and
> now i get a ton of false positives (including the confirmation and welcome
> messages from this mailing list !!) probably close to 20% of my valid
> emails are being marked as spam.
> does anyone have any ideas about how to fix this problem? it's worse now
> than if i had no filter, because i have to comb through every spam looking
> for non-spams! please help!
As Tony suggested, retrain from scratch. Some of the stuff in your
data really doesn't make sense. For example,
> Total emails trained: Spam: 1299 Ham: 3644
> header:Subject:1 0.673037 1135 833
> header:From:1 0.675923 1139 847
> header:To:1 0.67644 1139 849
> header:Date:1 0.676889 1138 850
That says, for example, that 3644-1135=2509 of the ham messages you
trained on didn't have a Subject line. That's unbelievable -- or you
have very weird ham <wink>. Similarly, about 2,500 of your ham
messages didn't have a To line, From line, or Date line in the
headers. Those are equally incredible. These kinds of header lines
should appear in virtually all email, whether ham or spam, and then
they're judged as neutral. Instead the presence of a Subject line
"looks spammy" to your database, and that's nuts.
This is also incredible:
> sender:no real name:2**0 0.004644 48 0
That says you've trained on no spam at all where the From line didn't
contain a real name -- yet that's very common in spam, and moderately
unusual in ham. You even have ubiquitous words like "the" and "and"
scoring as spammy! Something is seriously messed up with the training
here -- start over.
More information about the Spambayes