[Spambayes] Outlook plugin - training
Mark Hammond
mhammond@skippinet.com.au
Wed Nov 6 22:09:04 2002
[Piers responding to Paul]
> I don't believe you need this. I think that the classifier automatically
> trains on messages as they arrive (or at least on messages that it's
> sure about). You only need to retrain if it has made a mistake, or if
> it's unsure.
As Tim says, we really only do "mistake" training - nothing is trained as it
comes in, only scored. Manually moving messages (via the button or d&d) is
the only thing that triggers an incremental re-train.
The key limitation of this scheme, as Tim also alludes to, is that this
never correctly classifies ham. However, I actually see this incremental
training more as a "get smarter now" than a "just get smarter" technique -
ie, a user sees a mis-classified Spam, by re-training they are increasing
the chances that the next similar mail will be handled correctly. Instant
feedback, especially while the user is getting started.
ie, it is indeed "mistake based training", but that may still prove useful
in addition to ongoing training.
I can't help thinking that we are somehow underestimating our own tool here.
As is common when people first use this tool, spam is generally found in the
ham set and vice-versa. Because of this, I know that my Inbox is spam free
(but less sure about my other "ham" folders). I'm also sure that my Spam
folder has no ham. This should remain true while I continue to use the
tool.
So surely we can exploit this somehow. Off the top of my head:
* Assume we don't trust the last 2 days of mail (as the user may not yet
have sorted them). Anything in the "good" and "spam" folders older than
this can be assumed correctly classified, and able to be trained on.
* A process could go through all ham and spam trained on, and score each
message. Any "suspect" messages are presented in a list (much like the
Outlook "Find Message" result list). The user can indicate that the message
is correct (and the system will remember, never asking about this message
again) or is indeed incorrectly classified. If incorrect, it will be moved,
and incrementally trained as per now. (I can also picture a whitelist
kicking in here; if incorrect, offer to add user to whitelist. If user in
the whitelist, assume ham thereby meaning mail from this person can never
again be spam)
I can picture this working in the background, and simply indicating to the
user that there are "conflicts" to be resolved at their leisure. Further, I
imagine that as we build better training data for each message store, the
number of "conflicts" actually found would generally be zero - ie, the
system would find that all 2 day and older mail correctly classifies.
While the above is more a brain-fart than a reasoned design, I agree that
staying out of your face is important for widespread use.
Mark.
More information about the Spambayes
mailing list