[Spambayes] lots of unsures, heavily biased towards spam

skip at pobox.com skip at pobox.com
Sun Feb 4 20:21:47 CET 2007


    >> If the interface you're using allows you to delete trained mails you
    >> could also try deleting a bunch of old mails you classified as spam.

    Dave> It does, but I have to confess I don't really understand the
    Dave> implications of doing so.

I think most people agree that the nature of spam changes over time.  New
hosts are compromised, new spam techniques are developed, etc.  If you have
a database of 1000 spams and only 100 hams, it seems likely to me that the
later spams are more important as examples of the types of spams you're
likely to receive in the next few days.  Accordingly, when I find my
ham:spam ratio getting a bit out-of-whack, I generally throw out a few old
spams.

I know this won't help you with the imap filter, however...  I use the
train-to-exhaustion script in the contrib directory which helps keep my
ham:spam ratio tractable.  I have it train with a fixed ratio (right now, 2
spams to 1 ham) and have it train from newest to oldest messages.  Given a
pair of spam and ham mailboxes it thus reverses them then trains using 2
spam, 1 ham, 2 spam, 1 ham, ... until one mailbox is exhausted.  It ignores
any remaining messages in the other mailbox.  The cycle repeats for any
messages which weren't correctly scored on the last pass.  Once a message
scores correctly, it isn't considered again.  If a message scores correctly
the first time it's tossed out altogether.

Skip





More information about the SpamBayes mailing list