[Spambayes] Outlook plugin - training

Fri Nov 8 06:46:14 2002

[Moore, Paul]
> ...
> I'm assuming (based on a message I recall seeing recently) that it's
> possible to "correct" training - ie, if I train the classifier that a
> specific message is spam, I can later say "no it isn't, it's ham".

That's right, and at the level of classifier.py it's a two-step process:
unlearn() as spam, then learn() as ham.  It actually doesn't matter which
order those are done in, but I won't admit to that <wink>.

> Assuming that this is so, is it not reasonable to train dynamically
> on an "assume I got it right" basis?

Depending on context, it *may* be.

> In other words, whenever the addin filters a message as ham or spam,
> automatically train on that basis as well. Then, if the user sees a
> mistake, he corrects it, which automatically retrains the classifier
> (manually deleting as spam or moving a message already does this).

Assuming a conscientious user, and a client that knows enough about what the
user is doing, that should work fine.

> This will keep the database right up to date, and all the user has to
> do is correct any bad decisions the classifier makes (which he should
> be doing anyway).
>
> I've ignored database growth issues, but other than that, is there any
> other problem with this approach?

Doubtless hundreds, but why quibble <wink>.  A misclassified msg will have
bad effects at once if the training gets reflected into the probabilities at
once, so it gets less appealing the less zealous the user is about
correcting mistakes right away.  That can be mitigated by doing the day's
training into a distinct dict, or not calling update_probabilities() in a
single dict, until "the end of the day", when the user has (presumably)
corrected all the day's mistakes they're going to correct.  But if the model
updating is going to be delayed anyway, then it makes as much sense to delay
doing any training on "the day's" msgs until the end of the day.
Determining what "the end of the day" means is a puzzle then too.  For
example, maybe I left my email client running and went on a week-long
vacation.  I'm not going to look over 700 presumed spam when I get back,
I'll just delete it.  But if ham was in there, I've now let it train in the
wrong direction, and that will hurt.

In other contexts, the scheme doesn't get off the ground.  For example, for
python.org use, nobody is going to review msgs claimed to be spam.  A system
feeding on its own judgments is going to reinforce its own mistakes too, so
the "conscientious, timely, reviewing human" bit is important.