[Spambayes] lots of unsures, heavily biased towards spam

Sun Feb 4 18:13:47 CET 2007

"Seth Goodman" <sethg at goodmanassociates.com> writes:

> David Abrahams wrote on Saturday, February 03, 2007 9:01 PM -0600:
>
>> "Seth Goodman" <sethg at goodmanassociates.com> writes:
>>
>> > If your training set has much more spam than ham, you can train on
>> > ham that already scores properly.
>>
>> That'll help?  Great; it's easy enough.
>
> There is anecdotal evidence that this helps, as well a few systems where
> it doesn't seem to matter.  If Spambayes is not classifying well enough,
> this is a good thing to try.

Done, last night.

>> > Whether you choose ham that scores very low already (typical ham) or
>> > the highest scoring ham (unusual ham) is your preference.
>>
>> Are you suggesting that it makes no difference?
>
> Not at all ... only that no one can tell you for sure which is better
> for your own mail flow.  

:)

> My preference for adding ham to a training set is to pick the
> highest scoring ham 

You mean literally the ones with scores closest to 1.0?

> and train on a few at a time, rescoring the ham folder after
> training each new group.

Sorry, lots of questions:

  - what does "rescoring the ham folder" mean?  

   - When you say "ham folder" are you referring to a folder full of
     ham used for training?

   - If so, what difference would it make to allow Spambayes to adjust
     the scores on those messages?

   - When you "pick the highest scoring ham" are you picking from your
     general mail history or are you picking from the ham folder and
     training those mails again?

I think a glossary or terminology section would be a nice addition to
the spambayes site :)

> It's deliberately indefinite, as results are variable.  I can tell you
> that my setup has been operating at around 5% unsures, 0.5% false
> negatives (spam in the inbox) and perhaps one false positive (ham in the
> spam folder) per year for a long time.  This seems to be typical, though
> 0.1% false positives might be more common.  My current training set has
> around 250 ham and 500 spam.  What kind of performance do you see?

Well, I haven't been measuring carefully, unfortunately.  I just have
a feeling that I could do better.  After balancing ham and spam last
night I woke up to 75 messages in my SPAMBOX all correctly identified
as spam, 20 messages (all spam) in my UNSUREBOX and and 7 new messages
in my INBOX, two of which were spam.  I have various server-side rules
that are filing some new messages in other mailboxes but from a casual
look it appears that none fell into those categories.  Just as for
you, Spambayes has for years been very good about not classifying ham
as spam.  However, it used to be that spam very rarely crept into my
INBOX whereas recently I have been getting 2-3 false negatives every
night.

>> > I try to avoid mine going further than 2:1 and train on
>> > my highest scoring ham to fix it.  This seems to work better for me
>> > than training only on unsures.
>>
>> I don't get nearly enough unsures that are ham to correct the
>> imbalance that way.
>
> The strategy you imply is train on all unsures

I don't quite do that; my spam training folder would be out of control
if I did.  But I am fairly indiscriminate about dumping unsure
messages into my spam training folder.  If I ever get ham classified
as unsure, I make sure to train on that.

> which happens to be the method the Outlook plugin is based on.  This
> is because it is easy to understand and generally works well.  One
> problem is that over time, train on unsures tends to result in a
> training set that has a lot more spam than ham, 

Right.

> and this sometimes causes the classifier to function poorly (more
> weasel words).

Another terminology gap.

> If that is your problem, you need to train on additional ham that
> already classifies correctly.  The only way you can tell if that's
> your problem is to train on more ham and see if that helps.

OK;  hard to tell yet.

>> > Please let us know what you try, what helps and what doesn't.
>>
>> I will, but aren't you afraid there are just too many levers to pull,
>> what with all the configuration options and legit approaches to
>> training?  Seems like it would be hard to learn much from user
>> feedback.
>
> There are quite a few variables, and I appreciate your willingness to
> report back.  

You're welcome.

> The developers do read this list, and your results will be
> noted.  As far as what is learned from whom, there has been a lot of
> careful testing by a lot of people using a purpose built testing system,
> but it's good to continue to do reality checks.  If what you report
> reinforces the current view, that's good news.  If there are persistent
> reports where it doesn't agree, then there is something to look at.  So
> yes, end user feedback is very helpful.

Great.  Another problem is that I don't have a rigorous way to measure
performance.  Any ideas?

Dreaming of a tool that can record my configuration changes, record
training records, and learn about misclassification based on the mails
I throw into the ham and spam training folders (looking at the
X-Spambayes-Classification header), so we can get a clearer picture of
what works...

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com