[spambayes-dev] lowest scoring message isn't always "best" onetotrain on

Mon Jan 19 14:19:03 EST 2004

Seth Goodman wrote:
> [Skip Montanaro]
>> Note that the first item has a very low spamprob itself, but of the
>> bunch I displayed, the best ones to train on to push the most other
>> spams into spam range all score around 0.8 to 0.9.  ...
> 
> ...  I've noticed that I also get the most shifting of
> untrained spam classifications from unsure to spam on the later
> messages I train on, that is, the ones with higher scores.  My
> recollection is that things start to move much better when the spam I
> add to the training set is around 75% or higher.  The low-scoring
> unsures do move a few other low-scoring unsures up in score, but I
> seem to get considerably more "action" out of the higher-scoring
> ones.

Speaking theoretically with no evidence to back it up:

It seems to me that this is an expected outcome.  If you train on a
single message, you've added only 1 to the spam count of each token.
How much that raises the score of other messages depends both on the
size of your current training set and on how similar other messages are
to the one you trained.  Messages that are similar to the message you
choose to train on are probably also going to have similar initial
scores.  Pushing a message that is already close to the threshold into
the spam region doesn't take much of an increase, but pushing a very
low-scoring message over the threshold is much more difficult and a
single message probably won't be enough to do it in many cases.

Just because training a certain message pushes the most other messages
into the spam region doesn't necessarily mean it represents the greatest
improvement in the classifier.  Chances are good that if I have a large
number of unsures then I'm not going to stop training after only one
message.  If I'm going to train on N messages during the same training
session, the order in which I train them isn't important.  The key is to
choose the smallest possible training *set* such that *all* the other
unsure messages will be identified properly.  Maybe a closer
approximation to this would be to look for the message that causes the
greatest increase in the mean spam score of the remaining messages.

-- 
Kenny Pitt