[spambayes-dev] A spectacular false positive

Tue Nov 25 22:03:25 EST 2003

[Skip Montanaro]
> ...
> Here's something I think would be interesting.  At the moment I have
> about 40 unsures awaiting a decision from me (train or discard).  I'm
> trying conciously to be conservative.  What I'd like to know is which
> message, if added to my training database, would have the greatest
> effect on the scores of the other unsure messages.  That would help
> me decide which ones yield the most benefit.

If you can define what "greatest effect on the scores of the other unsure
messages" means, exactly, then it should be easy to automate that decision
(for each unsure:  train on it, score all the other unsures, compute "the
effect" on their scores (whatever that means to you), untrain it; then pick
the one with the greatest whatever-it-is you measured).

Google on

    "active learning" classification

to get a warm fuzzy feeling that this may be a fine thing to do <wink>.

I train on "the worst" Unsure first (lowest-scoring spam or highest-scoring
ham), then rescore Unsures, and repeat until they're all gone.  A number of
Unsures usually get resolved on their own this way, especially
near-duplicates of a new spam.  I don't spend any time any more trying to
guess whether a message "really is" ham or spam -- if it's not obvious after
5 seconds, I toss it without training on it at all.