[Spambayes] Another hammie setup
T. Alexander Popiel
popiel at wolfskeep.com
Sun Dec 15 21:10:02 EST 2002
A couple weeks ago, I mentioned that I was finally going to start
using hammie for my live filtering, and that I'd share the scripts,
etc that I generated to do so.
First off, let me describe how I've got things set up. I am an
avid (and rather religious) MH user, so my mail folders are of
course stored in the MH format (directories full of single-message
files, where the filenames are numbers indicating ordering in the
folder). I've got four mail folders of interest for this discussion:
everything, spam, newspam, and inbox.
When mail arrives, it is classified, then immediately copied in the
everything folder. If it was classified as spam or ham, it is
trained as such, reinforcing the classification. Then, if it was
labeled as spam, it goes into the newspam folder; otherwise it
goes into my inbox.
When I read my mail (from inbox or newspam), I move any confirmed
spam into my spam folder; ham may be deleted. (Of course, I still
have a copy of my ham in the everything folder.)
Every night, I run a complete retraining (from cron at 2:10am);
it trains on all mail in the everything folder that is less than
4 months old. If a given message has an identical copy in the spam
or newspam folder, then it is trained as spam; otherwise it is
trained as ham. This does mean that unread unsures will be
treated as ham for up to a day; there's few enough of them that
I don't care. The four-month age limit will have the effect of
expiring old mail out of the training set, which will keep the
database size fairly manageable (it's currently just under 10 meg,
with 6 days to go until I have 4 months of data).
The retraining generates a little report for me each night,
showing a graph of my ham and spam levels over time. Here's
a sample:
Scanning spamdir (/home/cashew/popiel/Mail/spam):
Scanning spamdir (/home/cashew/popiel/Mail/newspam):
Scanning everything
sshsshsshsshsshsshsshshsshshshshsshshshshshshsshsshshsshssshsshshsshshsshshsssh
shshshsshshsshshshshshssshshshsshsshsshshshshshshsshshhshshsshshshshssshssshshs
ssshs
154
152|
144|
136|
128| h
120| h s
112| s ss ss s h s ss
104| ss ss ss sHs h s ss
96| s ss s sH s ss sHs h Sss ss
88| h ss s sss ss sH sss ssssHHhS sSsssss
80| s sSH ss ssssss sssssH HssssHsHHHSS sSsssss
72| ssHSH ssssssssssssHHsHSHssHsHsHHHSSssSsssss
64| s s s s sHsHSHsssssssHsHsssHHsHSHssHsHsHHHSSssSsssss
56| s sss ss sssssHHHSHsHsssHsHHHHssHHsHSHHsHHHsHHHSSsHSsssss
48| ssssssssssssssHHHSHHHHssHsHHHHHsHHsHSHHsHHHsHHHSSsHSssHsss
40| ssssssssssHsHHHHHSHHHHHsHsHHHHHHHHHHSHHsHHHHHHHSSsHSHsHHss
32| ssHHssHsssHHHHHHHSHHHHHHHsHHHHHHHHHHSHHsHHHHHHHSSHHSHHHHHs
24| ssHHHHHHHsHHHHHHHSHHHHHHHsHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
16| HsHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHs
8| HHHHHHHHHHHHHHHHHSHHHHHHHHHHHHHHHHHHSHHHHHHHHHHSSHHSHHHHHH
0|SSSUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU
+------------------------------------------------------------
Total: 6441 ham, 9987 spam (60.79% spam)
real 7m45.049s
user 5m38.980s
sys 0m39.170s
This is a set of overlaid bar graphs; s is for spam, h is for ham,
u is unsure. The shorter bars are in front and capitalized. In
the example, I have very few days where I have more ham than spam.
My scripts (and a .procmailrc) are available at:
http://www.wolfskeep.com/~popiel/spambayes/hammie
- Alex
More information about the Spambayes
mailing list