[Spambayes] Training

Tue Nov 19 22:39:37 2002

[Mark Hammond]
> ...
> My concern is almost identical though - the *next* email that looks the
> same.  Let's say I subscribe to a weekly newsletter.  This weeks comes in,
> gets marked as unsure, so I train.  Next weeks comes in - again, it trains
> as unsure.  Repeat ad nauseum.
>
> I saw this a real lot when I had a high ham:spam inbalance -
> training had no obvious effect.

Conflating this, though, there were glitches in the Outlook client back than
that prevented retraining and/or rescoring from working as intended.

> I am still hoping to try Tim's new adjustment,

Note that it's already enabled in the Outlook client (but not in the general
codebase yet) -- the first time you do anything that recomputes the
probabilities, it will kick in with full force.

That's actually going to make the described problem worse:  when you have a
lot more ham than spam, the effect of the adjustment is to make everything
"less hammy" than it was.  This should help a lot when training on spam, but
makes training on ham *less* effective than it was.  In effect, it's saying
that training on new ham is much less valuable than training on new spam,
because you already have way more of the former.

> but I wonder if somehow similar maths could be exploited.  For example,
> manually training a message could be seen as "intense training", wereas a
> normal train is - well - normal.  The point of manual training is that the
> system got it wrong, and the user want to see the error stop.  "normal"
> training is just giving the system fairly "general" instructions.

You could feed a msg into training more than once as ham (or spam).  The
classifier doesn't know the difference between training on a single msg N
times, and training on N different msgs.  We could even feed the msg in, in
a loop, until the score went out of Unsure territory.  That would be
novel -- picture the effects on the system if I were to do this with my
Nigerian-scam quote.  Brrr!

But no matter how we cut this, so long as there's more of one kind of data
than the other, the class with the lesser amount of data is the one that
limits potential accuracy.

> The only reason I mention this is because last time I mentioned something
> that demonstrated my ignorance, Tim promptly replied confirming it, then
> subsequently made the change anyway <wink>.

Familiar patterns are such a comfort to us all <wink>.