[Spambayes] Training on unusual ham - revisited

Seth Goodman sethg at GoodmanAssociates.com
Mon Feb 13 00:13:30 CET 2006


On  -0600, Tony Meyer wrote:

> I'm still mostly of the opinion that using some sort of 'train to
> exhaustion' regime would work best.  This would allow both expiry and
> balancing (it essentially does pruning), and still deliver excellent
> results.

I agree that train-to-exhaustion is very appealing.  How does it
accomplish expiry?


> However, it would mean keeping cached mail around for a
> while, at least.

Well, not cached mail, but the list of tokens that were trained from
that message.  In the case of train-to-exhaustion, you'd also need a
training count to tell you how many times you trained on the message.

<...>

> Training should be done on all unsure messages, too.  When I was
> using the Outlook plug-in, I commonly had ham end up as (low scoring)
> unsure.  That should reduce the imbalance somewhat.  Theoretically,
> once SpamBayes starts making mistakes, the number of ham-as-unsure
> would increase, thus helping the balance.

I use thresholds of 0.05 and 0.80, and the result is that virtually
every message in unsure is ham.  It is a convenience to not have the
same number of ham as spam classify as unsure.  So unless you're willing
to leave the ham threshold very low and tolerate ham showing up in the
unsure folder pretty regularly, the database will tend to become
unbalanced over time, in addition to growing faster than it otherwise
needs to.



> Something that I think would help is not training every false
> negative/spam-as-unsure.  Something along the lines of training one,
> then rescoring the others to see if they need training.  However, the
> plug-in does not make this a simple task, at least at the moment.

Yes, this is another option to just deleting unsure spam.  Here's a
scheme that would automate this and encourage users to avoid
overtraining.

- Create two new folders under the unsure folder called "reclassified as
ham" and reclassified as spam".

- Upon a training event, rescore the messages in the ham, spam and
unsure folders.  If messages change classification do as follows:  move
unsures in to the unsure folder, move newly classified ham into
"reclassified as ham" and newly classified spam into "reclassified as
spam".

- Have an additional button for "accept training" that moves messages
from "reclassified as ham" into ham, moves message from "reclassified as
spam" into spam without doing incremental training.  After the operation
was complete, the "accept training" button and the empty "reclassified
as ..." folders would disappear.  The reason to delete the empty folders
is that upon training a new message, seeing one or both of the
"reclassified as ..." folders appear would draw the user's attention to
any reclassifications, which are probably mistakes that need to be
corrected.

Here are some pro's and cons.


pro:

1) Makes results of training a single message immediately obvious.

2) Removes unsures that now classify as ham or spam from the unsure
folder.

3) Avoids leaving newly created false positives and false negatives in
the ham and spam folders, where they are easy to miss.

4) Makes it more obvious when a user trains a message into the wrong
classification, as several other messages will immediately move to the
unsure or "reclassified as ..." folders.

5) Does not require the user to display spam scores and make decisions
based on them.

6) Encourages the user to train on the smallest number of messages
necessary to create correct classifications.

7) Compatible with train-to-exhaustion.  If a message is trained as ham
or spam but still doesn't classify correctly, it automatically goes back
to the unsure folder.


con:

1) Requires dynamically creating and deleting two other folders under
unsure.

2) Requires a third button for the unsure folder that is context
sensitive.

3) Will generate user questions.

--
Seth Goodman



More information about the SpamBayes mailing list