[Spambayes] Outlook plugin - training

Wed Nov 6 22:38:46 2002

Tim Peters <tim.one@comcast.net> writes:

> [Moore, Paul, on the Outlook2K client]
>> Actually, I'm not sure I like "Potential Spam" being treated as
>> spam until confirmed as OK.
>
> It doesn't.  "Potential Spam" really means "unsure" -- it would be
> as accurate to call it "Potential Ham", but neither is as accurate
> as Unsure.  The system knows it doesn't know what to call msgs in
> this category, and the client doesn't automatically train on Unsure
> msgs (unless you *manually* drag one into your Spam folder, or into
> one of your Ham folders).

That sounds like the best option. But it makes me wonder - what is a
"Spam" folder, and what is a "Ham" folder, in this context? My best
guess is that we're looking at the folders defined in the training
dialog. I'm having difficulty following the addin code, but that feels
logical (I've never seen an Outlook addin before, so I'm struggling
with "lots of code, can't see the flow" problems ATM...)

>> I have Rules Wizard rules which sort E-Mail traffic out into
>> folders.  I'm entirely happy with the behavious I understand to be
>> the case - rules processed before the plugin - as I don't get spam
>> on list addresses, so I'm OK with list traffic being totally
>> excluded from the spam process.
>
> The Define Filters dialog has a multi-selection folder control, so
> you can tell the client to watch any number of folders (you're not
> limited to the Inbox alone; add the destination folders of your
> other Outlook rules if you want email coming into those watched
> too).

I'm not entirely sure I do. As I said, anything moved by the rules
wizard is list traffic, and as such is (a) non-spam (so no need to
check it) and (b) not at all typical of personal mail. My intuition
says that including list traffic will tend to dilute the clues which
distinguish personal mail and spam. Of course, I know that the
classifier *really* works by magic, and so my intuition is useless :-)

> The interaction with Outlook's Rules Wizard (RW) remains unclear.
> The RW's internal workings appear undocumented, and there appears no
> way to hook into it.  I've definitely seen the addin's filtering
> rules trigger *while* the RW was still running, and in some cases
> that can lead to the addin's filtering looking at a msg more than
> once.  For example, the addin's filter may trigger when a msg first
> arrives in the Inbox, and then a second time on the same msg when
> the RW moves it into a different folder that the addin's filter is
> also watching.  In this case the client suffers an internal
> exception, as the entry ID Outlook told it to use for the first
> trigger gets invalidated by the move.  It works OK in the end, but
> "something isn't quite right" about it.

Ooh, that's even worse than I thought (and also entirely consistent
with what I've come to expect from Outlook :-()

>> I think I may switch off the "potential spam" bit, and just filter
>> out known spam, and classify my Inbox by hand. I'll leave it a bit
>> longer before deciding, though.
>
> You'll be happier if you keep an Unsure folder.  For me, about 1% of
> my email ends up there, about half-and-half ham vs spam, and my
> Inbox is virtually spam-free (while my Spam folder is pure spam now
> -- about 100 per day).

You could easily be right on this. It's not so much that I don't want
an Unsure folder, as that I don't know how best to manage it. My
instinctive reaction is that I want "Spam" and "Not Spam" buttons, and
then I read or delete the message in situ. Using the act of moving the
message to indicate the status feels wrong.

But maybe, in the light of what you said above (about watching
multiple folders), I need to rethink this - for "normal mail" folders
at least, if not for list traffic.

OK, I'll try thinking in terms of 4 categories of folder - ham, spam,
unsure, and "list traffic". In real terms, "list traffic" is no
different than unsure, other than in that the addin will never put
mail into the "list traffic" folders. I think that fits what I'm
after, and doesn't stray too far from the "expected model". I even
think that (if it works) I can write the logic up well enough to serve
as the basis for some documentation :-)

> Another:  Note that this is pre-alpha software, and you should definitely
> keep persistent Ham and Spam folders for training, as updating the code may
> invalidate your database(s), or introduce tokenization and/or scoring and/or
> configuration changes that render your database(s) worse than useless.  IOW,
> you should stay prepared to retrain from scratch.  I set up a distinct .pst
> file to hold Ham and Spam examples for this purpose, to keep from cluttering
> my primary msg store.  The folder controls in the addin (unlike several in
> Outlook itself!) allow selecting multiple folders from multiple msg stores
> too, and my Spam folder is actually in this other .pst file.

Oh, I agree. I'm keeping spam now, so that I have a good training set
of spam. I already keep loads of ham, so I don't feel the need to keep
any more. But I do delete a particular *type* of message - the
one-liners from Accomodation Services about cars with their lights
left on, fire alarm tests, and the like. I'd rather not bother
retaining these - just read and hit the delete button. OK, maybe I
could code up a "move to ham archive" button which I could put next to
the delete button. Maybe that's worth doing. It's back to that "how
does the classifier know?" question again :-)

Paul.

-- 
This signature intentionally left blank