[Spambayes] Training Disparity Issues
Richard B Barger ABC APR
Rich at RBarger.com
Mon Jul 19 06:58:49 CEST 2004
Tony Meyer wrote:
> > - I make extensive use of Netscape Mail's filters. SpamBayes
> > is set to add "spam" and "unsure" headers, but not "ham."
> Is this through some sort of modification? By default, SpamBayes will add a
> "X-SpamBayes-Classification" header for all messages: ham, spam and unsure. Or
> do you mean that you're also adding a notation to the to/subject header, but
> only for spam and unsures?
RBB: The latter, Tony. In the "Header Options" section of the Web Interface
Configuration page, I have checked "spam" and "unsure" in the section, "Classify
in subject header," but not "ham."
I don't know how to sort or filter in Netscape unless I have SpamBayes add the
"spam" and "unsure" notations. Then sorting is a breeze. The
"X-SpamBayes-Classification" header is behind the scenes and doesn't show up in
my messages unless I click the "All Headers" button.
> > - I have continued to reduce my ham and spam score cutoffs (currently
> > Ham = 0.01, Spam = 0.39),
> The spam threshold is *very* low. If a token hasn't been seen before, it gets
> a score of 0.5. So if you get a message comprised completely of tokens you
> haven't seen before, the message will score 0.5 (it's a tad more complicated
> than this, but it's a workable lie-to-children). With these thresholds, that
> means it'll be spam. Having the spam threshold over 0.6 would be a good idea,
RBB: That's what I wanted to know, Tony, and it's what I was afraid of, so
THANK YOU! Your "lie-to-children" makes perfect sense to me.
My experience right now (no data, just gut feeling) is this:
-I still get a little spam in my ham folder, but it's a small percentage.
-I have been getting too many unsures, but once I gradually dropped my Spam
cutoff to the too-low 0.39, their number has dropped quite a bit and has become
-90+ percent of my unsures still are spam.
-Right now, I continue to visually scan all spam message Subject lines. After
what I considered my initial training period, I've received fewer than a dozen
ham messages in my spam folder -- that's maybe 12 out of 32,600 messages. And
each of the improperly classified spam messages has been unimportant; while I
kind of wanted to see them, it wouldn't have made any difference had I >not<
I should have checked the scores and tokens of these, but most occurred before I
knew how to do so. If it happens again, I'll know to check.
Anyway, what I think applies to me is: My ham mail stream must be pretty
uniform and accurately defined by SpamBayes. However, if I get a new type of
ham message, you are telling me that there is a strong likelihood it will be
You've been clear that it would be prudent to raise my Spam cutoff. Even so,
I'll have to think about this, because as long as my current setting isn't
misclassifying, it saves me from having to manually deal with an annoyingly high
level of Unsures, most of which are proving to be Spam.
Question: What about my Ham Score cutoff of 0.01? Many of my Hams come out
with an "X-Spambayes-Spam-Probability:" of 0.00000, but, of course, not all
do. Because SpamBayes has been doing a good job of classifying ham, I only
check the spam-probability scores occasionally. What do you suggest?
> > but I still get far too many unsures;
> Roughly what percentage of your incoming mail would be unsure? Common numbers
> AFAIK are between 2 and 5%, which would be 30-75 messages per day with 1500
> incoming messages.
RBB: Hmmm. I thought I had read on the list that 2-3 percent was about right.
This weekend was a poor test, and training yesterday and today based on Adam
Walker's extremely helpful responses (Thank you again, Adam) has reduced the
volume of unsures and will change my results. I'll have a better idea this
coming week, but I believe my Unsures had been about twice that high (maybe a
little less than 10 percent) before revising my training regimen.
> > 4 - I've made surprisingly few training mistakes (I think),
> > but I don't remember reading how to correct a message incorrectly
> > trained, when using the POP3 Proxy. How do I do this?
> If the message is still in the sb_server caches (by default they expire out of
> there in 7 days), you can use the "find message" query on the front page. This
> will bring up the message in a standard review page. Any
> untraining/retraining required based on your selection will be done
RBB: Ah! Makes sense. This info probably is in the Help or the FAQs
somewhere, but I missed it.
So, if I had trained a message as Spam, and came back and trained the same
message as Ham, SpamBayes would no longer consider the tokens as added to the
"spam" pile and would, instead, add them to the "ham" pile (using the
workable-lie-to-children methodology mentioned earlier)?
> If the message isn't still in the sb_server caches, then there isn't any
> facility for doing this with sb_server. One of the command-line tools (if
> you're running from source) could do this, I presume. You can just train the
> message (via the train facility on the front page) correctly, which will
> 'cancel out' the incorrect training (assuming that no tokenizing options have
> changed in the meantime), in some ways. This is far from ideal, though.
RBB: So my intuition was sort of correct, when I asked (in a different context)
if training the same message (using the "Train on a message ..." box on the Web
Interface page) multiple times as, say, ham helps solidify its tokens as ham --
and might even sort of "overrule" incorrect training of that message as spam.
(Badly worded, but I hope you get the idea.)
> > This ham-spam disparity has been an occasional topic in this group
> > lately. If roughly equal piles of ham and spam are important for most
> > effective classification, it appears to me that it might be useful for
> > the program simply to include a weighting factor.
> There once was one (the experimental_ham_spam_imbalance option). It proved to
> hurt more than help, and so was deprecated then removed. If someone can come
> up with one that works, then it would certainly get put throught he tests, and
> added as an experimental option if it does seem to work over many corpora.
RBB: You guys are thorough; I should have known that you'd already looked into
>From my perspective, with huge piles of spam and a relatively small pile of ham
to offset training; and with the spam of many, many different types, and the ham
relatively uniform; I end up training hams that are all but the same as other
hams, just to keep the training sort of balanced. So you can see why it
appeared to me that a multiplier or factor of some type might work just as
well. Thanks for the explanation.
> There isn't really enough known about the effects of different training
> regimes at the moment. There's a fair chunk of stuff on the wiki
> <http://entrian.com/sbwiki> which you probably should read (or skim), for
RBB: I'll do that. Thanks.
> There's at least one training technique (train-to-exhaustion, or tte) that
> *forces* a balanced database. Testing on the different training regimes has
> been pretty limited so far, but it looks like tte is as good as, if not
> better than, any of the others. At least one SpamBayes developer uses tte
> for SpamBayes training. The difficulty is that although there's a tte.py
> script in the source dist, there isn't really any simple way to do tte with
> sb_server at the moment (this will probably arrive, but not soon). You
> could probably rig up some sort of system, but it would be complicated.
RBB: I'm not up to learning how to use source; sorry. I'll wait for the
paperback edition. But it sounds promising.
Tony, I don't know how to thank you sufficiently without simultaneously drooling
all over myself, but I certainly appreciate the extremely helpful, thoughtful,
> =Tony Meyer
> Please always include the list (spambayes at python.org) in your replies
> (reply-all), and please don't send me personal mail about SpamBayes. This
> way, you get everyone's help, and avoid a lack of replies when I'm busy.
More information about the Spambayes