[Spambayes] Training Disparity Issues

Tony Meyer tameyer at ihug.co.nz
Mon Jul 19 11:13:57 CEST 2004


[RBB]
> The latter, Tony.  In the "Header Options" section of 
> the Web Interface Configuration page, I have checked "spam"
> and "unsure" in the section, "Classify in subject header,"
> but not "ham." I don't know how to sort or filter in Netscape
> unless I have SpamBayes add the "spam" and "unsure" notations.

Checking, it turns out that you can't AFAICT (you can in later versions of
Netscape/Mozilla).  No matter - I was just curious and it's not relevant to
the rest of this, anyway.

[RBB]
> Your "lie-to-children" makes perfect sense to me.

As an aside to anyone reading this: I'm not sure how widely recognised the
"lie to children" phrase is outside of physicists & Discworld readers.
Here's a definition: <http://en.wikipedia.org/wiki/Lie-to-children>

[RBB]
> Anyway, what I think applies to me is:  My ham mail stream 
> must be pretty uniform and accurately defined by SpamBayes.
> However, if I get a new type of ham message, you are telling
> me that there is a strong likelihood it will be misclassified.

Yes, although it depends *how* new.  The headers, for example, are almost
certain to have been seen before (although they may push the message in the
wrong direction, of course).

[RBB]
> You've been clear that it would be prudent to raise my Spam 
> cutoff.  Even so, I'll have to think about this, because as long
> as my current setting isn't misclassifying, it saves me from
> having to manually deal with an annoyingly high
> level of Unsures, most of which are proving to be Spam.

Yes, the golden rule is always 'if it isn't broken, don't fix it'.

[RBB]
> Question:  What about my Ham Score cutoff of 0.01?  Many of 
> my Hams come out with an "X-Spambayes-Spam-Probability:" of 0.00000,
> but, of course, not all do.  Because SpamBayes has been doing a good
> job of classifying ham, I only check the spam-probability scores
> occasionally.  What do you suggest?

I don't know.  This is one area where the Outlook plug-in (which I use for
the most part) has a big advantage (as a result of tight integration with
the mail client) - I see the spam score for all messages in a little column
that I have to the left of the display.  So I know for sure that I get the
odd ham that's up near 10%.  If you don't, then 0.01 might work for you (and
if it is, then stick with it).

Perhaps you could use the Review page to scan through the unsures, and see
what the scores are like?  (To do this, you have to enable two advanced
options via the web interface - one to add the "Score" header, and one to
show the "Score" column in the review page).

[Percentage of unsures]
> Hmmm.  I thought I had read on the list that 2-3 percent was about right.

Well, 2-3 percent is probably the more normal end, but some testing has
resulted in numbers closer to 5%.

[...]
> I'll have a better idea this coming week, but I believe my Unsures
> had been about twice that high (maybe a little less than 10 percent)
> before revising my training regimen.

You should definitely manage to get down to 5% or less, from what we've seen
elsewhere.

[Tony Meyer]
> If the message is still in the sb_server caches (by default 
> they expire out of there in 7 days), you can use the "find message"
> query on the front page. This will bring up the message in a standard
> review page.  Any untraining/retraining required based on your selection
> will be done automatically.

[RBB]
> Ah!  Makes sense.  This info probably is in the Help or the FAQs
> somewhere, but I missed it.

It may not be.  The documentation needs a lot of work in places,
particularly with newer features.  (We're working on it, but slowly).

[RBB]
> So, if I had trained a message as Spam, and came back and 
> trained the same message as Ham, SpamBayes would no longer
> consider the tokens as added to the "spam" pile and would,
> instead, add them to the "ham" pile (using the
> workable-lie-to-children methodology mentioned earlier)?

Yes, exactly.  The message is first 'untrained', then trained in the new
category.  'Untraining' isn't made manually accessible, because it's easy to
screw up the database if you untrain a message that was never trained (you
end up with negative counts and ugly things like that).

[RBB]
> So my intuition was sort of correct, when I asked (in a 
> different context) if training the same message (using the
> "Train on a message ..." box on the Web
> Interface page) multiple times as, say, ham helps solidify 
> its tokens as ham -- and might even sort of "overrule" incorrect
> training of that message as spam.
> (Badly worded, but I hope you get the idea.)

Yes.  The main complication (as I understand things, and I'm not the stats
expert) is that the total number of ham/spam messages trained also has an
effect.  So while you're strengthening the tokens in that message, in a way
you're also weakening the tokens that aren't in that message.

[RBB]
> I'm not up to learning how to use source; sorry.  I'll 
> wait for the paperback edition.  But it sounds promising.

I think that this is the area where SpamBayes stands to gain most at the
moment.  Two reasons: (1) it seems there are reasonable gains to me made in
accuracy by using a good training regime (whereas it's hard to find similar
gains in the classifier/tokenizer anymore) and (2) this is one of the
biggest usability flaws at the moment, IMO.  We get a lot of "it isn't
classifying properly" emails to which the response is "please try balancing
your database", and it would be great if we could take care of that
automatically.

BTW, one other thing to try, although to get the proper effect you do need
to retrain (although you can always save the current databases and move back
to them later) is the unigram/bigrams experimental option.  The idea is
basically that the classifier looks at both individual 'words' and pairs of
words, although it's more complicated than that.  There hasn't been enough
testing done to say for sure that it's always better to use it, but it
certainly looks good in most situations, and many of the developers
(including me and the guy who *is* the stats expert) are using it in our
day-to-day systems.

If you want to try it out, you'll have to open up your configuration file
(the top of the configuration web page will say where it is) in something
like notepad and add the lines:

   [Classifier]
   x-use_bigrams: True

(without the leading spaces) to the end of the file.

The database will end up bigger, but it should be more accurate and should
also learn more quickly.  This might also help nail the 'spam with a story'
type messages that you described in another post (which I haven't managed to
read properly yet).

(There are a whole slew of other options that could be fiddled about with as
well, but that's the one that offers the most widespread hope).

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.



More information about the Spambayes mailing list