[Spambayes] Training on unusual ham - revisited
Tony Meyer
tameyer at ihug.co.nz
Sun Feb 12 04:58:22 CET 2006
> The difficulty is that there's no way to prune the database, either to
> adjust the imbalance or to simply decrease the database's size. You
> have
> to start again from scratch. The Spambayes establishment doesn't
> consider this to be much of an issue, since (as Seth points out)
> Spambayes does a good job of starting from scratch and building an
> acceptable scoring system after seeing surprisingly little data.
I'm not sure that I'd say that it's not considered much of an issue.
The problem is that pruning a database is difficult. As I understand
it, the only safe way to do this is to remove/add entire messages,
rather than individual tokens. However, the SpamBayes database only
keeps track of individual tokens and their ham/spam counts, so we
don't have enough information to remove a set of added tokens, unless
the original message is available. (IIRC, Skip created an enhanced
database that kept enough information to do this at some point; the
code is probably around somewhere).
I'm still mostly of the opinion that using some sort of 'train to
exhaustion' regime would work best. This would allow both expiry and
balancing (it essentially does pruning), and still deliver excellent
results. However, it would mean keeping cached mail around for a
while, at least. I just don't have the time at the moment (as the
failure to get 1.1a2 out demonstrates) to implement this for the
Outlook plug-in or sb_server (I did do a partial sb_server
implementation some time ago, but I don't recall how far I got).
> Another point (I've made it before, but I guess it bears repeating) is
> that the database imbalance is absolutely inherent in the current
> implementation of the Spambayes algorithm, at least in the Outlook
> plugin. Because users set the cutoffs to avoid false positives (you
> have
> to if the program is going to be useful), virtually all of Spambayes's
> mistakes are false negatives. Since mistakes are all you train on
> after
> the initial startup, virtually all new entries into the database are
> spam. The better job Spambayes does, the worse the imbalance becomes.
Training should be done on all unsure messages, too. When I was
using the Outlook plug-in, I commonly had ham end up as (low scoring)
unsure. That should reduce the imbalance somewhat. Theoretically,
once SpamBayes starts making mistakes, the number of ham-as-unsure
would increase, thus helping the balance.
Something that I think would help is not training every false
negative/spam-as-unsure. Something along the lines of training one,
then rescoring the others to see if they need training. However, the
plug-in does not make this a simple task, at least at the moment.
> [...] it's a problem that has yet to be solved.
I certainly agree that this is true. ISTM that the 'imbalance'
problem is one that is shared by other filters, as well (c.f. the
discussion of the problem in the TREC Spam Track papers). Anyone
know of a good statistician with time to spare? <0.1 wink>
=Tony.Meyer
--
Please always include the list (spambayes at python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.
More information about the SpamBayes
mailing list