[spambayes-dev] A spectacular false positive
skip at pobox.com
Thu Nov 27 07:01:26 EST 2003
>> What I'd like to know is which message, if added to my training
>> database, would have the greatest effect on the scores of the other
>> unsure messages. That would help me decide which ones yield the most
Tim> If you can define what "greatest effect on the scores of the other
Tim> unsure messages" means, exactly, then it should be easy to automate
Tim> that decision (for each unsure: train on it, score all the other
Tim> unsures, compute "the effect" on their scores (whatever that means
Tim> to you), untrain it; then pick the one with the greatest
Tim> whatever-it-is you measured).
I mean "pushes the remaining unsures the furthest away from their current
scores". I guess I want to maximize:
sum([abs(old-new) for (old,new) in zip(oldprobs, newprobs)])
Tim> Google on
Tim> "active learning" classification
Tim> to get a warm fuzzy feeling that this may be a fine thing to do
Thanks. When I get a chance, I may. On the other hand, I may just take
your word for it.
Tim> I train on "the worst" Unsure first (lowest-scoring spam or
Tim> highest-scoring ham), then rescore Unsures, and repeat until
Tim> they're all gone. A number of Unsures usually get resolved on
Tim> their own this way, especially near-duplicates of a new spam
I've been doing this sort of thing, though perhaps not consistently enough.
Tim> I don't spend any time any more trying to guess whether a message
Tim> "really is" ham or spam -- if it's not obvious after 5 seconds, I
Tim> toss it without training on it at all.
More information about the spambayes-dev