[Spambayes] lots of unsures, heavily biased towards spam

Sun Feb 4 06:10:57 CET 2007

David Abrahams wrote on Saturday, February 03, 2007 9:01 PM -0600:

> "Seth Goodman" <sethg at goodmanassociates.com> writes:
>
> > If your training set has much more spam than ham, you can train on
> > ham that already scores properly.
>
> That'll help?  Great; it's easy enough.

There is anecdotal evidence that this helps, as well a few systems where
it doesn't seem to matter.  If Spambayes is not classifying well enough,
this is a good thing to try.

>
> > Whether you choose ham that scores very low already (typical ham) or
> > the highest scoring ham (unusual ham) is your preference.
>
> Are you suggesting that it makes no difference?

Not at all ... only that no one can tell you for sure which is better
for your own mail flow.  My preference for adding ham to a training set
is to pick the highest scoring ham and train on a few at a time,
rescoring the ham folder after training each new group.  There are a lot
of different approaches, and there hasn't been a clear winner that works
better on everyone's mail flow.

>
> > If you use the Outlook plugin,
>
> No offense to all the Outlook users out there, but I avoid it like the
> plague.  I'm using sb_imapfilter and doing the filtering server-side.

No offense taken.  This is a public mailing list for a spam filtering
program with a specific version for Outlook.  How to use it with Outlook
is of interest to a lot of readers.

>
> > just move the ham you want to train on to the unsure folder and tell
> > Spambayes it's not spam.  How much trained ham/spam imbalance is too
> > much is also up for debate.  Some people have reported good results
> > with 5:1 and even 10:1 imbalance, while others do poorly under those
> > conditions.
>
> Sounds pretty indefinite.  What's poorly mean?

It's deliberately indefinite, as results are variable.  I can tell you
that my setup has been operating at around 5% unsures, 0.5% false
negatives (spam in the inbox) and perhaps one false positive (ham in the
spam folder) per year for a long time.  This seems to be typical, though
0.1% false positives might be more common.  My current training set has
around 250 ham and 500 spam.  What kind of performance do you see?

>
> > I try to avoid mine going further than 2:1 and train on
> > my highest scoring ham to fix it.  This seems to work better for me
> > than training only on unsures.
>
> I don't get nearly enough unsures that are ham to correct the
> imbalance that way.

The strategy you imply is train on all unsures, which happens to be the
method the Outlook plugin is based on.  This is because it is easy to
understand and generally works well.  One problem is that over time,
train on unsures tends to result in a training set that has a lot more
spam than ham, and this sometimes causes the classifier to function
poorly (more weasel words).  If that is your problem, you need to train
on additional ham that already classifies correctly.  The only way you
can tell if that's your problem is to train on more ham and see if that
helps.

> > Please let us know what you try, what helps and what doesn't.
>
> I will, but aren't you afraid there are just too many levers to pull,
> what with all the configuration options and legit approaches to
> training?  Seems like it would be hard to learn much from user
> feedback.

There are quite a few variables, and I appreciate your willingness to
report back.  The developers do read this list, and your results will be
noted.  As far as what is learned from whom, there has been a lot of
careful testing by a lot of people using a purpose built testing system,
but it's good to continue to do reality checks.  If what you report
reinforces the current view, that's good news.  If there are persistent
reports where it doesn't agree, then there is something to look at.  So
yes, end user feedback is very helpful.

--
Seth Goodman