[Spambayes] Proposing to remove 4 combining schemes

T. Alexander Popiel popiel@wolfskeep.com
Thu Oct 17 23:38:23 2002


In message:  <3DAF2DE1.5090404@hooft.net>
             Rob Hooft <rob@hooft.net> writes:
>Tim Peters wrote:
>> [Sean True]
>
>>>If I'm not sure it's spam, I'd prefer a score that matched that.
>> 
>> 
>> Under chi-combining, a score under .95 (as a rule of thumb so far) does mean
>> "I'm not sure it's spam".  So quantifying this would be helpful.
>
>My gut feeling says: under ideal combining, a score under .95 means "I'm 
>less than 95% sure this is spam".

Ah, here's the basic problem... the final score we're generating has
very little to do with a percentage, or any human concept of assurance.
Heck, the final number isn't even a percentage of how much the message
looks like ham or spam, since we're combining _those_ two numbers in
very non-percentage-like ways.

On the other hand, end users are quite likely to inappropriate
interpretations like this on the numbers, if they see them... so in
any final presentation of this system, I'd _STRONGLY_ discourage
showing the numbers.  Just the three categories 'spam', 'ham', and
'unknown' should be sufficient.

<rant>
People who are not statisticians tend to make a lot of silly
interpretations of numbers, particularly when those numbers are
percentages (or look like percentages).  If I tell people "I'm 75%
sure these dice are loaded", the vast majority of them will expect
that they will roll particular values 75% of the time.  (Translation
to spambayes: for every message in some set of messages, a classifier
says it's 75% sure that the message is spam... and people think that
about 3/4 of those messages will be spam.  As a simple disproof,
consider if all the messages are identical.)  People just don't grok
that surety has very little to do with distribution of results.
They also tend to go for all sorts of logical fallacies like a
statement implying its converse, excluded middles, etc.

The score we've got is just a number in the range 0 to 1 which has
interesting discriminatory properties.  It's not linear with any
concept of surety, and it's not linear with similarity to spam or
ham, either.  People not immersed in how it's generated and/or
buried in test results over decent sized corpora are sure (there's
that troubling word again) to misinterpret it.
</rant>

>But Sean's "sort on score" idea is also very useful. I think it'd speed 
>up the manual scanning/deletion process.

Having looked at the results from the show_unsure config option,
I tend to disagree... position in the list doesn't seem to have
any correlation with spam vs. ham.

- Alex