[Spambayes] SpamBayes now filers less than 50% of my spam.

Fri Nov 14 12:18:43 EST 2003

    Rob> I am having the same problem. I have 311 ham and 1093 spam in my
    Rob> training database yet SB only catches about 50% of the incoming
    Rob> spam.

I would try retraining from scratch.  I'm getting very good results with
just over 100 ham and spam at the moment.

I'm beginning to believe it's not even necessary to train on all unsures or
mistakes.  My mail gets transferred from my server and scored in bunches
every five minutes (24x7) and I get a lot of mail, so I may come in in the
morning and find a dozen unsures in my mailbox (as well as a few hundred
properly classified spams).  I try training on one or two unsure messages,
then recheck the remaining unsures, eliminating any which now score as ham
or spam.  (See below for how I do that.)

I've developed a few seat-of-the-pants training maxims, both from personal
experience and from reading what others have done:

    * Don't be afraid to retrain from scratch.  The system learns quickly.
      Retraining from scratch is often the quickest way to recover from
      training mistakes.

    * Bigger is not always better, no matter what all those enlargement
      messages would have you believe.  A larger database is harder to
      examine for mistakes, and a few mistakes skewed in the same directionn
      may be hard to overcome with correct training.  You'll also reach a
      point where you want to just delete all that spam.  Once you do that,
      you've completely lost the ability to find mistakes.  If you only have
      a few messages in your training database things will be easier to
      manage.

    * Never train on the same message twice.  Using iterative reasoning it's
      easy see you should never train on the same 100 or 1000 times
      either. ;-)

    * Seek balance in your training database.  Similar numbers of ham and
      spam are good.

    * Don't automatically train on all incoming messages.  If you get
      swamped with spam, you will quickly wind up with a training database
      which is wildly out-of-balance.

    * Don't worry about training on every unsure message either.  Some
      messages just aren't amenable to a strict classification.  For
      example, a bounce message from a mail server containing an attached
      spam may be best left untrained.  It contains both strong ham clues
      (all the postmaster gibberish which you would get in a bounce of an
      otherwise valid message) and strong spam clues (the spam message
      itself).  Calling that message as ham or spam is likely to worsen the
      classification of future mail bounces or future similar spam.

My environment is much different than yours, so I don't know how you'd get
the Outlook plugin to score messages again, but if it can do that, a little
judicious checking will probably avoid the need to over-train.  For example,
if there are several unsure messages related to online prescriptions,
training on just one of them as spam may be sufficient to cause the rest to
now score as spam.

For those on Unix-y systems (I use Mac OS X) with access to the CVS
repository, here's what I run to check my unsures:

    sb_filter.py ~/Mail/unsure | python ~/tmp/scan-unsures.py

Where scan-unsures.py is 

    #!/usr/bin/env python

    import sys, re

    sub = msgid = cls = ""

    for line in sys.stdin:
        if line.startswith("From "):
            sub = msgid = cls = ""
        elif line.lower().startswith("subject: "):
            sub = line.strip()
        elif line.lower().startswith("message-id: "):
            msgid = line.strip()
        elif line.lower().startswith("x-spambayes-classification: "):
            cls = line.strip()
            if re.search("unsure", cls) is not None:
                print sub
                print msgid
                print cls
            sub = msgid = cls = ""

You need the latest version of sb_filter.py which I checked in a couple days
ago.

Skip