[Spambayes] Re: Have you guys considered...

Skip Montanaro skip at pobox.com
Fri Aug 8 16:26:58 EDT 2003


    Derrick> On Fri, Aug 08, 2003 at 11:12:31AM -0700, Reale, Tom wrote: |
    Derrick> You have built a product that could autofile e-mail into all
    Derrick> kinds | of private categories for the user.

    Derrick> This is a separate issue entirely.  Spambayes is designed to
    Derrick> identify spam, but it doesn't do anything with it.  Sorting
    Derrick> mail into separate folders is best done with the proper LDA,
    Derrick> such as procmail, maildrop, sieve, or (depending on your mail
    Derrick> reader) filter rules in the mail reader itself.  I'd rather see
    Derrick> spambayes' kept as simple as possible by focusing on a single
    Derrick> problem/task.

It's really the same problem, just an n-way classification instead of a
2-way classification.  I use procmail to sort my mail into folders.  My
procmailrc file looks like this:

    score the message with hammiefilter
    if it scores as spam, file as spam
    if it scores as unsure, file as unsure
    ### normal procmail filters from this point
    if message sent to "foo" list, file in "foo" mailbox
    if message sent to "bar" list, file in "bar" mailbox
    ...
    file in mbox

While not nearly as annoying as spam, it's still a mild bother that if
someone bcc's a message to the foo list, procmail will file it in my mbox
instead of my foo mailbox.

Tools like ifile and CRM114 can classify mail into N bins instead of just
two.  I think SpamBayes could be used as-is to solve this problem.  Suppose
I have four categories of mail: spam, python, perl, and ruby.  I could train
like this:

    train spam as "spam", and python+perl+ruby as "ham" - write to db spam.db
    train perl+ruby as "spam", and python as "ham" - write to db python.db
    train python+ruby as "spam", and perl as "ham" - write to db perl.db
    train python+perl as "spam", and ruby as "ham" - write to db ruby.db

In procmail, you'd run hammie multiple times, then file based upon the
result:

    score the message with hammiefilter -d spam.db
    if it scores as spam, file as spam
    score the message with hammiefilter -d python.db
    if it scores as ham, file as python
    score the message with hammiefilter -d perl.db
    if it scores as ham, file as perl
    score the message with hammiefilter -d ruby.db
    if it scores as ham, file as ruby
    ...
    file in mbox (or unsure)

Of course classification gets more compute-intensive(*) and is more difficult
to train.  I suspect it would be fairly good at classifying your mail
though.

Skip

(*) If such a scheme allowed you to dispense with procmail entirely, it
might not cost much, however.  A single Python process scoring against
multiple SpamBayes databases would probably be more efficient than procmail
forking off a bunch of hammiefilter processes.  Of course, to replace
procmail you'd have to be as careful not to lose mail as it is.

S



More information about the Spambayes mailing list