[Spambayes] How low can you go?

Thu Dec 11 16:45:31 EST 2003

[Skip Montanaro]
> Nothing magic or random.  I primed the pump one ham and one spam.  Then
> sorted the unsures which arrived by score.  Train the lowest
> scoring spam as
> spam.  Now rescore the unsure mailbox only considering messages which are
> now scored as spam.  Delete them.  Lather.  Rinse.  Repeat.  You will
> obviously have many hams which initially score as unsure as well.  Do the
> same thing for them, just start from the highest scoring ham.

I just re-read this and realized I missed something key in your description.
Your training set is culled only from unsures, rather than the set of all
messages.  My adaptation of your algorithm for Outlook on the Wiki is wrong,
and I'll fix it.  The more important thing is that your method is really
"train on unsures", which is fundamentally different from mistake-based
training and train on everything.  The particular incremental method of
selecting a minimal subset that makes a good classifier can be applied to
any original corpus.  The corpus that we select the training set from
defines the training tactic.

The real question is then, what corpus should you select the training set
from (what is the best training tactic)?  The choices identified so far are:

- train on errors
- train on unsures
- train on errors + unsures
- train on errors + unsures + non-obvious correct decisions
- train on everything

Train on errors defines mistake-based training, with it well-debated
properties (see the "Watch out for digests ..." thread).  Using the unsures
for the original corpus makes it very different from mistake-based training
because it doesn't include *any* mistakes, it consists entirely of messages
that classified as "I can't decide".  It's also different from train on
everything because it doesn't include *any* messages that classified
correctly.  I don't know what it's properties would be, other than it
appears to iteratively maximize the bimodal nature of the message score
distribution by reducing unsures.  I suggest that because you are training
on unsures and picking your training set such that the corpus of unsures
becomes very bimodal with few or no unsures.

This method, if retrained often enough, will probably result in a small
number of unsures, perhaps the smallest of all the methods.  How it will
perform on false positives and false negatives is a separate question, since
they are not included in the message corpus the training set is selected
from.  I don't have an opinion one way or another, I'm just now recognizing
how different a training tactic this is.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above