[Spambayes] Training Disparity Issues
Richard B Barger ABC APR
Rich at RBarger.com
Sat Jul 17 03:52:34 CEST 2004
Hi. I have some training disparity questions.
I'm using POP3 Proxy Version 1.0rc2. I get huge amounts of email --
1400 or 1500 messages a day -- sent to several recipient names, all of
which are picked up by the same POP3 account.
- I make extensive use of Netscape Mail's filters. SpamBayes is set to
add "spam" and "unsure" headers, but not "ham." The filters first skim
off messages I routinely receive, then sort mail into "Spam," "Unsure,"
"CornerBarPR," and a bunch of other folders, with any unfiltered mail
eventually dropping through my filter sieve into my Inbox.
- The content of this mail stream is more than 80 percent spam or, at
least, unwanted mail.
- Because of the volume, I get many, many different types of spam; my
ham mail seems to be much more uniform and more easily categorized.
- I have continued to reduce my ham and spam score cutoffs (currently
Ham = 0.01, Spam = 0.39), but I still get far too many unsures; almost
all of them are spam:
- Out of 31,119 classified messages, after initial training I've only
had about a half dozen that showed up in the Spam folder but that I
would have personally classified as ham, and none of those messages was
- I get about 2-3 percent spam in my ham folders, more in the
CornerBarPR ham (Inbox) folder and less in the RBarger.com ham folder.
- I haven't tracked the unsures carefully, but I believe they're about
95 percent spam.
- I train on the unsures that are within a reasonable distance of my
spam score cutoff of 0.39 (I'd train on a message with a spam
probability of 0.25, for instance, but probably not on one that was
0.08), but wonder if reducing the spam cutoff any further is playing
with fire, increasing my risk of beginning to filter legitimate mail
into the spam folder?
You can tell that this is one issue I don't understand too well.
- Because of the ham-spam disparity, whenever I train on unsures, I have
to force-feed ham to try to keep the training proportion reasonable. I
rarely have any new types of ham to train for "balance," so I end up
training on ham messages that are very similar to ones trained
previously. Right now, I've trained 1299 spam and 764 ham.
That leads to my questions (thank you for being patient):
1 - First, I don't think the message header information contributes to
the spam score. I don't see it showing up in the message clues. But,
just to be sure, here's my question: Would >identical< mail sent to
rich at rbarger.com, rich at cornerbarpr.com, info at cornerbarpr.com,
support at cornerbarpr.com, rich at swbell.net, rich at sprintmail.com, etc., be
scored the same? Or would the different addressees and other minor
differences in the headers cause scoring differences?
(I have the impression that, out of the same mail stream, mail filtered
to rich at rbarger.com tends to be scored properly, whereas I get more spam
in my rich at cornerbarpr.com folder.)
2 - Similarly, If I train a message manually (which, with POP3, is a
cut-and-paste operation), will the message score differently if all
headers are showing than if headers are displayed normally?
3 - I don't know your algorithm, but I'd guess that, as I get a larger
training corpus, each trained message contributes a smaller amount to
scoring than would be the case with fewer trained messages. True? If
so, does that mean that to "move the needle" for unsure or misclassified
ham or spam, I should train on the same message numerous times?
4 - I've made surprisingly few training mistakes (I think), but I don't
remember reading how to correct a message incorrectly trained, when
using the POP3 Proxy. How do I do this?
5 - I know that everyone's mail stream and personal ham-spam threshold
is different, but are the ham-spam score cutoffs I'm using (0.01 and
0.39) reasonable, or have I missed something obvious -- an instruction
or message in which one of you gurus has posted cutoff settings or
6 - Can you suggest anything else I should be considering in setting my
ham-spam score cutoffs, to deal with my still-too-big pile of unsures,
while still leaving me with only a tiny chance of a Type II error?
7 - I've thought about wiping out my database and starting over, in
hopes that I could reduce the pile of unsures requiring continued manual
training. I certainly get enough mail that it wouldn't be a problem to
come up with enough new examples, and I now know a lot more about the
different "types" of spam I'm receiving and would be somewhat smarter on
Because of the volume of unsures, training already is a time-consuming
process, so re-training wouldn't be that big an issue. But that still
doesn't solve the ham-spam disparity. My goal is to reduce the number
of unsures: What would you suggest?
This ham-spam disparity has been an occasional topic in this group
lately. If roughly equal piles of ham and spam are important for most
effective classification, it appears to me that it might be useful for
the program simply to include a weighting factor.
Even though you may philosophically object, I'd think a "weight" is not
any different than me training on so many ham messages that are very
much like hams I've already trained on. Comments?
Whining aside, I love SpamBayes. It certainly is moving huge piles of
spam into the proper place -- mail that I otherwise would have
I have another couple of issues and questions, but let me stop with this
long message and post them later.
Thanks for your help.
More information about the Spambayes