[Spambayes] Outlook plugin - training

Tim Peters tim.one@comcast.net
Fri Nov 8 07:20:18 2002


[Mark Hammond]
> ...
> The key limitation of this scheme, as Tim also alludes to, is that this
> never correctly classifies ham.  However, I actually see this
> incremental training more as a "get smarter now" than a "just get
> smarter" technique - ie, a user sees a mis-classified Spam, by re-
> training they are increasing the chances that the next similar mail
> will be handled correctly.  Instant feedback, especially while the user
> is getting started.
>
> ie, it is indeed "mistake based training", but that may still prove
> useful in addition to ongoing training.

I sure agree it's *very* useful at the start, and expect it will continue to
be useful over time.

> I can't help thinking that we are somehow underestimating our own
> tool here.

I'm going to try an experiment:  I'm going to wipe my home database and
start over from scratch, training first on one ham and one spam, then only
on mistakes and unsures.  This should be fun <wink>.

> As is common when people first use this tool, spam is generally
> found in the ham set and vice-versa.  Because of this, I know that my
> Inbox is spam free (but less sure about my other "ham" folders).  I'm
> also sure that my Spam folder has no ham.  This should remain true
> while continue to use the tool.

How do you know your Spam folder has no ham?  I know mine doesn't because I
routinely score it, sort on the score, and stare at "the wrong end".  I find
ham there as often as not, *usually* apparently due to mousing error when
dragging a training ham into the Ham folder and overshooting the mark.

> So surely we can exploit this somehow.  Off the top of my head:
> * Assume we don't trust the last 2 days of mail (as the user may not
> yet have sorted them).  Anything in the "good" and "spam" folders older
> than this can be assumed correctly classified, and able to be trained
> on.

Provided the user has already done a decent amount of training, then as Paul
Moore suggested it could even work to trust ham-vs-spam decisions
immediately, and let user corrections undo those as needed.  A well-trained
system should be pretty robust against a few misclassifications over the
short term.

> * A process could go through all ham and spam trained on, and score each
> message.  Any "suspect" messages are presented in a list (much like the
> Outlook "Find Message" result list).  The user can indicate that the
> message is correct (and the system will remember, never asking about
> this message again) or is indeed incorrectly classified.  If incorrect,
> it will be moved, and incrementally trained as per now.  (I can also
> picture a whitelist kicking in here; if incorrect, offer to add user to
> whitelist.  If user in the whitelist, assume ham thereby meaning mail
> from this person can never again be spam)

Tell us about the mistakes *you* see.  I feel like we're designing a
solution to a hypothetical problem otherwise.  The only "mistake" I
routinely see is that my cigarettes-via-web advertising keeps getting
knocked back into Unsure territory.  That doesn't bother me enough to do
anything about it, but if it bothers you enough <wink> then, yes, a
whitelist would solve that one.

> I can picture this working in the background, and simply indicating to
> the user that there are "conflicts" to be resolved at their leisure.

Or maybe we could just move those back to the Unsure folder.  The user
should already know what to do about things in Unsure, so it's nothing new
to them.  Moving a msg out of Unsure could be taken as a positive sign that
the user has classified such a msg once and for all (well, until they move
it again, anyway).

> Further, I imagine that as we build better training data for each
> message store, the number of "conflicts" actually found would
> generally be zero - ie, the system would find that all 2 day and
> older mail correctly classifies.

I expect that's true.




More information about the Spambayes mailing list