[Spambayes] Training Disparity Issues
Richard B Barger ABC APR
Rich at RBarger.com
Mon Jul 19 21:33:44 CEST 2004
See below, pls. I think I found a glitch, and there are a couple of substantive
questions, if you look far enough.
Tony Meyer wrote:
<lots of snippage here>
> As an aside to anyone reading this: I'm not sure how widely recognised the
> "lie to children" phrase is outside of physicists & Discworld readers. Here's
> a definition: <http://en.wikipedia.org/wiki/Lie-to-children>
> > Anyway, what I think applies to me is: My ham mail stream
> > must be pretty uniform and accurately defined by SpamBayes.
> > However, if I get a new type of ham message, you are telling
> > me that there is a strong likelihood it will be misclassified.
> Yes, although it depends *how* new. The headers, for example, are almost
> certain to have been seen before (although they may push the message in the
> wrong direction, of course).
RBB: I understand.
> > You've been clear that it would be prudent to raise my Spam
> > cutoff. Even so, I'll have to think about this, because as long
> > as my current setting isn't misclassifying, it saves me from
> > having to manually deal with an annoyingly high
> > level of Unsures, most of which are proving to be Spam.
> Yes, the golden rule is always 'if it isn't broken, don't fix it'.
RBB: But I will try to stay awake, and "keep my eye on it." And if I screw up,
I will know the source of the problem.
> > Question: What about my Ham Score cutoff of 0.01? Many of
> > my Hams come out with an "X-Spambayes-Spam-Probability:" of 0.00000,
> > but, of course, not all do. Because SpamBayes has been doing a good
> > job of classifying ham, I only check the spam-probability scores
> > occasionally. What do you suggest?
> I don't know. This is one area where the Outlook plug-in (which I use for the
> most part) has a big advantage (as a result of tight integration with the mail
> client) - I see the spam score for all messages in a little column that I have
> to the left of the display. So I know for sure that I get the odd ham that's
> up near 10%. If you don't, then 0.01 might work for you (and if it is, then
> stick with it).
> Perhaps you could use the Review page to scan through the unsures, and see
> what the scores are like? (To do this, you have to enable two advanced
> options via the web interface - one to add the "Score" header, and one to show
> the "Score" column in the review page).
RBB: Yes! Yes! Yes! Absolutely! Thank you!
However, I believe I found a bug:
I had already clicked those settings several days ago, Tony, and nothing had
changed on my Review Messages screen. That is, the "Score" header had not been
added. I tried this in both Firefox and IE, with the same results.
After fiddling around a bit with the Advanced Configuration Interface Options, I
found that I needed to alter the very first option -- "Headers to display in
message review" -- before the Score header would appear in the Review Messages
After quite a bit of trial and error, I find that simply putting in a trailing
space after the default headers, "Subject From ", tricked the system into
displaying the Score header on the Review Messages page. Really.
Now maybe I've been drinking, but, if so, I don't remember it. <g> I couldn't
get the Score header to display until I made a change in the "Headers to display
in message review" window.
> [Percentage of unsures]
> > Hmmm. I thought I had read on the list that 2-3 percent was about right.
> Well, 2-3 percent is probably the more normal end, but some testing has
> resulted in numbers closer to 5%.
> > I'll have a better idea this coming week, but I believe my Unsures
> > had been about twice that high (maybe a little less than 10 percent)
> > before revising my training regimen.
> You should definitely manage to get down to 5% or less, from what we've seen
RBB: Tony, I believe the other changes I've been making in response to list
suggestions already have reduced my volume of unsures quite a bit.
> > So my intuition was sort of correct, when I asked (in a
> > different context) if training the same message (using the
> > "Train on a message ..." box on the Web
> > Interface page) multiple times as, say, ham helps solidify
> > its tokens as ham -- and might even sort of "overrule" incorrect
> > training of that message as spam.
> > (Badly worded, but I hope you get the idea.)
> Yes. The main complication (as I understand things, and I'm not the stats
> expert) is that the total number of ham/spam messages trained also has an
> effect. So while you're strengthening the tokens in that message, in a way
> you're also weakening the tokens that aren't in that message.
RBB: Excellent point, well put.
As I asked in an earlier message -- and I understand this isn't quite the same
thing -- "I don't know your algorithm, but I'd guess that, as I get a larger
training corpus, each trained message contributes a smaller amount to scoring
than would be the case with fewer trained messages. True?"
> Tony: We get a lot of "it isn't classifying properly" emails to which the
> response is "please try balancing your database", and it would be great if we
> could take care of that
> BTW, one other thing to try, although to get the proper effect you do need to
> retrain (although you can always save the current databases and move back to
> them later) is the unigram/bigrams experimental option. The idea is basically
> that the classifier looks at both individual 'words' and pairs of words,
> although it's more complicated than that. There hasn't been enough testing
> done to say for sure that it's always better to use it, but it certainly looks
> good in most situations, and many of the developers (including me and the guy
> who *is* the stats expert) are using it in our day-to-day systems.
RBB: As I've said previously, I don't mind retraining. I think I'd be a bit
smarter about it now. The deal for me is to correctly identify the database
files. Would it be hammie.db and spambayes.messageinfo.db?
I'll be happy to "test" it and give feedback, up to what obviously are the low
limits of my capabilities.
> If you want to try it out, you'll have to open up your configuration file (the
> top of the configuration web page will say where it is) in something like
> notepad and add the lines:
> x-use_bigrams: True
> (without the leading spaces) to the end of the file.
> The database will end up bigger, but it should be more accurate and should
> also learn more quickly. This might also help nail the 'spam with a story'
> type messages that you described in another post (which I haven't managed to
> read properly yet).
> (There are a whole slew of other options that could be fiddled about with as
> well, but that's the one that offers the most widespread hope).
RBB: The text on the Experimental Options page says that it might be a good
idea to increase the max-discriminators option from 150 to 300 or 600. I think
that's the "Maximum number of extreme words" setting on the Advanced
Configuration page. Your thoughts?
> =Tony Meyer
More information about the Spambayes