[Spambayes] Training Disparity Issues

Tony Meyer tameyer at ihug.co.nz
Mon Jul 19 04:50:57 CEST 2004


> - I make extensive use of Netscape Mail's filters.  SpamBayes 
> is set to add "spam" and "unsure" headers, but not "ham."

Is this through some sort of modification?  By default, SpamBayes will add a
"X-SpamBayes-Classification" header for all messages: ham, spam and unsure.
Or do you mean that you're also adding a notation to the to/subject header,
but only for spam and unsures?

> - I have continued to reduce my ham and spam score cutoffs (currently
> Ham = 0.01, Spam = 0.39),

The spam threshold is *very* low.  If a token hasn't been seen before, it
gets a score of 0.5.  So if you get a message comprised completely of tokens
you haven't seen before, the message will score 0.5 (it's a tad more
complicated than this, but it's a workable lie-to-children).  With these
thresholds, that means it'll be spam.  Having the spam threshold over 0.6
would be a good idea, IMO.

> but I still get far too many unsures;

Roughly what percentage of your incoming mail would be unsure?  Common
numbers AFAIK are between 2 and 5%, which would be 30-75 messages per day
with 1500 incoming messages.

> 4 - I've made surprisingly few training mistakes (I think), 
> but I don't remember reading how to correct a message incorrectly
> trained, when using the POP3 Proxy.  How do I do this?

If the message is still in the sb_server caches (by default they expire out
of there in 7 days), you can use the "find message" query on the front page.
This will bring up the message in a standard review page.  Any
untraining/retraining required based on your selection will be done
automatically.

If the message isn't still in the sb_server caches, then there isn't any
facility for doing this with sb_server.  One of the command-line tools (if
you're running from source) could do this, I presume.  You can just train
the message (via the train facility on the front page) correctly, which will
'cancel out' the incorrect training (assuming that no tokenizing options
have changed in the meantime), in some ways.  This is far from ideal,
though.

> This ham-spam disparity has been an occasional topic in this group
> lately.  If roughly equal piles of ham and spam are important for most
> effective classification, it appears to me that it might be useful for
> the program simply to include a weighting factor.

There once was one (the experimental_ham_spam_imbalance option).  It proved
to hurt more than help, and so was deprecated then removed.  If someone can
come up with one that works, then it would certainly get put throught he
tests, and added as an experimental option if it does seem to work over many
corpora.

There isn't really enough known about the effects of different training
regimes at the moment.  There's a fair chunk of stuff on the wiki
<http://entrian.com/sbwiki> which you probably should read (or skim), for
starters.

There's at least one training technique (train-to-exhaustion, or tte) that
*forces* a balanced database.  Testing on the different training regimes has
been pretty limited so far, but it looks like tte is as good as, if not
better than, any of the others.  At least one SpamBayes developer uses tte
for SpamBayes training.  The difficulty is that although there's a tte.py
script in the source dist, there isn't really any simple way to do tte with
sb_server at the moment (this will probably arrive, but not soon).  You
could probably rig up some sort of system, but it would be complicated.

=Tony Meyer

---
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.



More information about the Spambayes mailing list