[Spambayes] Another hammie setup

T. Alexander Popiel popiel at wolfskeep.com
Sun Dec 15 21:10:02 EST 2002


A couple weeks ago, I mentioned that I was finally going to start
using hammie for my live filtering, and that I'd share the scripts,
etc that I generated to do so.

First off, let me describe how I've got things set up.  I am an
avid (and rather religious) MH user, so my mail folders are of
course stored in the MH format (directories full of single-message
files, where the filenames are numbers indicating ordering in the
folder).  I've got four mail folders of interest for this discussion:
everything, spam, newspam, and inbox.

When mail arrives, it is classified, then immediately copied in the
everything folder.  If it was classified as spam or ham, it is
trained as such, reinforcing the classification.  Then, if it was
labeled as spam, it goes into the newspam folder; otherwise it
goes into my inbox.

When I read my mail (from inbox or newspam), I move any confirmed
spam into my spam folder; ham may be deleted.  (Of course, I still
have a copy of my ham in the everything folder.)

Every night, I run a complete retraining (from cron at 2:10am);
it trains on all mail in the everything folder that is less than
4 months old.  If a given message has an identical copy in the spam
or newspam folder, then it is trained as spam; otherwise it is
trained as ham.  This does mean that unread unsures will be
treated as ham for up to a day; there's few enough of them that
I don't care.  The four-month age limit will have the effect of
expiring old mail out of the training set, which will keep the
database size fairly manageable (it's currently just under 10 meg,
with 6 days to go until I have 4 months of data).

The retraining generates a little report for me each night,
showing a graph of my ham and spam levels over time.  Here's
a sample:

Scanning spamdir (/home/cashew/popiel/Mail/spam):
Scanning spamdir (/home/cashew/popiel/Mail/newspam):
Scanning everything
sshsshsshsshsshsshsshshsshshshshsshshshshshshsshsshshsshssshsshshsshshsshshsssh
shshshsshshsshshshshshssshshshsshsshsshshshshshshsshshhshshsshshshshssshssshshs
ssshs
  154
  152|
  144|
  136|
  128|                                                   h
  120|                                                   h      s
  112|                             s       ss     ss s   h   s  ss
  104|                             ss      ss     ss sHs h   s  ss
   96|                           s ss   s  sH  s  ss sHs h  Sss ss
   88|                    h  ss  s sss ss  sH sss ssssHHhS sSsssss
   80|                 s sSH ss ssssss sssssH HssssHsHHHSS sSsssss
   72|                 ssHSH ssssssssssssHHsHSHssHsHsHHHSSssSsssss
   64|      s  s  s s sHsHSHsssssssHsHsssHHsHSHssHsHsHHHSSssSsssss
   56|   s sss ss sssssHHHSHsHsssHsHHHHssHHsHSHHsHHHsHHHSSsHSsssss
   48|   ssssssssssssssHHHSHHHHssHsHHHHHsHHsHSHHsHHHsHHHSSsHSssHsss
   40|   ssssssssssHsHHHHHSHHHHHsHsHHHHHHHHHHSHHsHHHHHHHSSsHSHsHHss
   32|   ssHHssHsssHHHHHHHSHHHHHHHsHHHHHHHHHHSHHsHHHHHHHSSHHSHHHHHs
   24|   ssHHHHHHHsHHHHHHHSHHHHHHHsHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
   16|   HsHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
    8|   HHHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHH
    0|SSSUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
     +------------------------------------------------------------

Total: 6441 ham, 9987 spam (60.79% spam)

real    7m45.049s
user    5m38.980s
sys     0m39.170s

This is a set of overlaid bar graphs; s is for spam, h is for ham,
u is unsure.  The shorter bars are in front and capitalized.  In
the example, I have very few days where I have more ham than spam.

My scripts (and a .procmailrc) are available at:
  http://www.wolfskeep.com/~popiel/spambayes/hammie

- Alex



More information about the Spambayes mailing list