[Spambayes] Re: Have you guys considered...
Skip Montanaro
skip at pobox.com
Fri Aug 8 16:26:58 EDT 2003
Derrick> On Fri, Aug 08, 2003 at 11:12:31AM -0700, Reale, Tom wrote: |
Derrick> You have built a product that could autofile e-mail into all
Derrick> kinds | of private categories for the user.
Derrick> This is a separate issue entirely. Spambayes is designed to
Derrick> identify spam, but it doesn't do anything with it. Sorting
Derrick> mail into separate folders is best done with the proper LDA,
Derrick> such as procmail, maildrop, sieve, or (depending on your mail
Derrick> reader) filter rules in the mail reader itself. I'd rather see
Derrick> spambayes' kept as simple as possible by focusing on a single
Derrick> problem/task.
It's really the same problem, just an n-way classification instead of a
2-way classification. I use procmail to sort my mail into folders. My
procmailrc file looks like this:
score the message with hammiefilter
if it scores as spam, file as spam
if it scores as unsure, file as unsure
### normal procmail filters from this point
if message sent to "foo" list, file in "foo" mailbox
if message sent to "bar" list, file in "bar" mailbox
...
file in mbox
While not nearly as annoying as spam, it's still a mild bother that if
someone bcc's a message to the foo list, procmail will file it in my mbox
instead of my foo mailbox.
Tools like ifile and CRM114 can classify mail into N bins instead of just
two. I think SpamBayes could be used as-is to solve this problem. Suppose
I have four categories of mail: spam, python, perl, and ruby. I could train
like this:
train spam as "spam", and python+perl+ruby as "ham" - write to db spam.db
train perl+ruby as "spam", and python as "ham" - write to db python.db
train python+ruby as "spam", and perl as "ham" - write to db perl.db
train python+perl as "spam", and ruby as "ham" - write to db ruby.db
In procmail, you'd run hammie multiple times, then file based upon the
result:
score the message with hammiefilter -d spam.db
if it scores as spam, file as spam
score the message with hammiefilter -d python.db
if it scores as ham, file as python
score the message with hammiefilter -d perl.db
if it scores as ham, file as perl
score the message with hammiefilter -d ruby.db
if it scores as ham, file as ruby
...
file in mbox (or unsure)
Of course classification gets more compute-intensive(*) and is more difficult
to train. I suspect it would be fairly good at classifying your mail
though.
Skip
(*) If such a scheme allowed you to dispense with procmail entirely, it
might not cost much, however. A single Python process scoring against
multiple SpamBayes databases would probably be more efficient than procmail
forking off a bunch of hammiefilter processes. Of course, to replace
procmail you'd have to be as careful not to lose mail as it is.
S
More information about the Spambayes
mailing list