[Spambayes] Some more experiences with the Outlook plugin

Mark Hammond mhammond@skippinet.com.au
Tue Nov 12 04:44:11 2002


> I've now had the Outlook plugin running for about a week, and I'm
> starting to get a feel for using it. The following is my "user
> interface" experience. It's a slightly unrealistic combination of
> "what I actually did" and "what I realised afterwards I should have
> done", but it is what I would use as notes telling a new user how
> to set the system up, and as such it picks up on a few interesting
> issues:
>
> 1. To start with, configure the plugin to define one "Spam" folder and
>    one "Unsure" folder, and define all other folders as "Ham". [1]

Tim gives a great explanation of why this is not really possible - some
people simply have too much ham, while even for others, the relative ratios
are important.

> * Following on from this, I also see Tim's behaviour of surprising
>   unsure cases (or worse, false negatives!). Worst case recently was a
>   message which scored as solid ham. I trained on it as "Spam", and
>   rescored it. It still scored 5 - solid ham. My immediate reaction was
>   "But I just *told* you it's spam!". I know that isn't how the classifier
>   works, but even so it was unsettling. FWIW, I attach the spam clues for
>   this one (I don't know if they make any sense in isolation, but it can't
>   hurt...)

This too was my experience.  For a while, I did training over a huge ham
corpus, and spam is still less than 1000 messages.  I had around 15:1
ham:spam.  I too trained new ham and spam, and was dissappointed to see the
score remain almost identical.

Re-training on just my inbox yields far far better results - roughly 3:1
ham:spam.

Tim's idea of:
> In the list you gave below, there are very few hapaxes (I recognize
> them from the probabilities; I should probably add code to the client
> to display the raw counts too):

certainly would be useful.

Without the maths background, I find it interesting to ignorantly speculate
on these ratios.  Tim's analysis:
> '(and' is nearly "33 times closer" to 0 than '"remove"' is to 1,
> and that makes the accidental appearance of a ham word in spam much
> more powerful than the systematic appearance of a spam word in spam.

makes me wonder why the classifier can't exploit the ham:spam ratio to give
weighted results.  Or from another POV, what would happen if we artificially
boosted the ratio by training on each spam multiple times?

I speculate due to my experience with these large ratios, and the fact that
*every* one of these mails came through my Inbox.  Many messages are from
python.org's mailman - thus, the *true* ratio of ham:spam through my mail
account is much higher than the ham:spam ratio left once the mailing list
traffic is removed.  Even though the total spam is the same, the system will
score better or worse depending on the amount of ham I throw at it.  It
isn't intuitive to me why this need be so.

Mark.




More information about the Spambayes mailing list