[Spambayes] Interesting behaviour from the Outlook client

Tim Peters tim.one at comcast.net
Wed Dec 4 17:10:48 2002


[Moore, Paul]
> Over the past few days, I've been seeing an increase in FNs and
> Unsures. I initially trained on my inbox and spam folders (386
> ham, 999 spam), and since then I've trained on errors only. I'm
> now at 391 ham and 1011 spam. Initially, I was getting no errors,
> and 1 or 2 unsures per day. Now, I'm starting to get at least 1
> FN per day, and a slight increase in the unsure rate.

My experiments with mistake-based training all said it was brittle, due to
extreme reliance on hapaxes.  That makes it more of a keyword-spotting
classifier than a statistical inferencer.  But since you've trained on only
5 ham + 12 spam since starting mistake-based training, I think this is just
evidence that spam is changing.

> It's far too early to tell, but could this be related to Tim's
> code to handle unbalanced training sets? As time goes on, the
> spam:ham ratio will increase (as FNs happen more often than FPs)
> and so the impact of spam clues will be lessened (by Tim's code).

This is so, and an increase in FN is an expected outcome of the imbalance
adjustment, if you have more spam than ham.  If you want to experiment with
life without the imbalance adjustment, comment out the

experimental_ham_spam_imbalance_adjustment: True

line in your default_bayes_customize.ini file (in your spambayes Outlook2000
directory).  That will make everything look less spammy, so an increase in
FP is an expected outcome if you do this.

> I'll keep monitoring this, but my "real life" mail is definitely
> unbalanced (home is massively biased in favour of spam, work
> massively biased in favour of ham, but I pre-filter mailing lists
> which muddies the water badly).
>
> I dunno. Do the testing gurus round here have any idea whether
> this type of hypothesis could be tested in practice?

What exactly is the hypothesis?  Whatever it is <wink>, it's certainly
testable, but testing w/ Outlook is at best clumsy (testing is easiest if
you have a stream of plain-text msgs ordered by time received; getting that
out of Outlook is a series of battles).




More information about the Spambayes mailing list