[Spambayes] tons of false positives after upgrading

Tony Meyer tameyer at ihug.co.nz
Wed Jan 12 00:36:41 CET 2005


> OK, I went to re-train from scratch. I removed hammie.db,
> message_info_database.db, and statistics_database.db from 
> Documents and Settings/Owner/Application Data/SpamBayes/Proxy.
> I was going along fine all day, training new messages, then I
> went to review some additional messages (by right clicking on
> the tray icon). It pulled up a ton more messages than
> I was expecting, so I discarded all except 8 of them.  Then I 
> went to the home page and it says:
> Database only has 7 good and 1 spam - you should consider performing
> additional training.
>
> Apparently the reason it pulled up a ton of messages, was 
> because all of a sudden it decided that it hadn't trained on
> them already, even though it had. 

Were they definitely the same messages?  That wouldn't fit with my
suspicions (below).

> So the question is, what
> did I do before the error occurred, that might have caused
> spambayes to suddenly not remember any previous training.
> The answer -- the only thing i did was to modify the 
> configuration, so it would put the string "spam," in the "To:"
> and "Subject:" headers.
> 
> So is modifying the configuration supposed to undo all the 
> prior training?

No.

> If not, any guesses on why this happened?

There is a known (fixed in CVS) bug with 1.0.1 that means that if you make
any configuration changes via the web interface then many options are reset
to their default values until the next time you restart SpamBayes.  This has
had two effects reported so far: in the first, the columns in the review
page reset to just "Subject" and "From", and in the second, the cache
directories reset, so if you are using non-defaults for those, then mail
will be put in the wrong place and the review page will present mail from
there, instead of the correct place.

If this does turn out to be another case of this bug, then maybe we should
move 1.0.2 up a bit to next week, since it's starting to crop up fairly
often, and putting the release out earlier would be less work than
continually identifying this bug.

> Even though the home page says:
>  Total emails trained: Spam: 1 Ham: 7
>  Database only has 7 good and 1 spam - you should consider 
> performing additional training.
> 
> If you click on the "more statistics" link, it says:
>  SpamBayes has processed 124 messages - 27 (22%) good, 54 
> (44%) spam and 35
> (28%) unsure.
> 30 messages were manually classified as good (1 was a false 
> positive). 32 messages were manually classified as spam (4 
> were false negatives). 9 unsure messages were manually 
> identified as good, and 14 as spam.
> 
> so apparently something is corrupted again?  any ideas?

Hmm.  The "more statistics" page uses the 'message_info_database.db'
database, rather than hammie.db, which holds the token counts themselves.
It sounds like the same message_info db is always used, but different token
databases.

[Later]
Thinking about this more, I think this is a case of the bug described above,
because I'm guessing in your configuration file you have
'statistics_database.db' set as the token database (a hangover from a poor
choice (mine) in a version a while back) and so when you change the
configuration, you change the database that's being used.

There are two workarounds for this (until 1.0.2 is out, which has the bug
fix):

  1.  Change the configuration to use "hammie.db" as the "Storage file
name".  Since that's the default, the reset-to-default bug won't have any
effect on this.

  2.  Always stop and restart SpamBayes after making any changes on the
configuration page.

=Tony.Meyer

-- 
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.



More information about the Spambayes mailing list