[Spambayes] SpamBayes now filers less than 50% of my spam.
Skip Montanaro
skip at pobox.com
Fri Nov 14 12:18:43 EST 2003
Rob> I am having the same problem. I have 311 ham and 1093 spam in my
Rob> training database yet SB only catches about 50% of the incoming
Rob> spam.
I would try retraining from scratch. I'm getting very good results with
just over 100 ham and spam at the moment.
I'm beginning to believe it's not even necessary to train on all unsures or
mistakes. My mail gets transferred from my server and scored in bunches
every five minutes (24x7) and I get a lot of mail, so I may come in in the
morning and find a dozen unsures in my mailbox (as well as a few hundred
properly classified spams). I try training on one or two unsure messages,
then recheck the remaining unsures, eliminating any which now score as ham
or spam. (See below for how I do that.)
I've developed a few seat-of-the-pants training maxims, both from personal
experience and from reading what others have done:
* Don't be afraid to retrain from scratch. The system learns quickly.
Retraining from scratch is often the quickest way to recover from
training mistakes.
* Bigger is not always better, no matter what all those enlargement
messages would have you believe. A larger database is harder to
examine for mistakes, and a few mistakes skewed in the same directionn
may be hard to overcome with correct training. You'll also reach a
point where you want to just delete all that spam. Once you do that,
you've completely lost the ability to find mistakes. If you only have
a few messages in your training database things will be easier to
manage.
* Never train on the same message twice. Using iterative reasoning it's
easy see you should never train on the same 100 or 1000 times
either. ;-)
* Seek balance in your training database. Similar numbers of ham and
spam are good.
* Don't automatically train on all incoming messages. If you get
swamped with spam, you will quickly wind up with a training database
which is wildly out-of-balance.
* Don't worry about training on every unsure message either. Some
messages just aren't amenable to a strict classification. For
example, a bounce message from a mail server containing an attached
spam may be best left untrained. It contains both strong ham clues
(all the postmaster gibberish which you would get in a bounce of an
otherwise valid message) and strong spam clues (the spam message
itself). Calling that message as ham or spam is likely to worsen the
classification of future mail bounces or future similar spam.
My environment is much different than yours, so I don't know how you'd get
the Outlook plugin to score messages again, but if it can do that, a little
judicious checking will probably avoid the need to over-train. For example,
if there are several unsure messages related to online prescriptions,
training on just one of them as spam may be sufficient to cause the rest to
now score as spam.
For those on Unix-y systems (I use Mac OS X) with access to the CVS
repository, here's what I run to check my unsures:
sb_filter.py ~/Mail/unsure | python ~/tmp/scan-unsures.py
Where scan-unsures.py is
#!/usr/bin/env python
import sys, re
sub = msgid = cls = ""
for line in sys.stdin:
if line.startswith("From "):
sub = msgid = cls = ""
elif line.lower().startswith("subject: "):
sub = line.strip()
elif line.lower().startswith("message-id: "):
msgid = line.strip()
elif line.lower().startswith("x-spambayes-classification: "):
cls = line.strip()
if re.search("unsure", cls) is not None:
print sub
print msgid
print cls
sub = msgid = cls = ""
You need the latest version of sb_filter.py which I checked in a couple days
ago.
Skip
More information about the Spambayes
mailing list