[spambayes-dev] patch to improve statistics from spambayes

Fri Feb 27 10:47:06 EST 2004

Mark Moraes wrote:
> While I'm generally very happy with Spambayes, I was a bit
> confused by the statistics, which didn't seem to add up.

I think there are some good ideas here, but looks like some
misunderstandings as well.  I'll see if I can clear those up a little.
I've often wondered if we couldn't produce some more useful statistics,
so maybe this a good start to a discussion.

> 6 unsure good + 52 unsure spam  adds up to 58.  But the processed
> line says 63?  It's not clear how many messages were manually
> reviewed/trained.

This should indicate that there were 5 unsures that were not trained.  I
considered adding "and 5 were untrained" to the stats line.

> After using the Review web page to train and mark all 4 unsure as
> spam, 2 ham as spam and leaving all spam as-is (yay!), I see:
> 
> SpamBayes has processed 1150 messages - 754 (66%) good, 333 (29%)
> spam and 63 (5%) unsure. 333 messages were manually classified as
> good (0 were false positives). 414 messages were manually classified
> as spam (35 were false negatives). 6 unsure messages were manually
> identified as good, and 56 as spam. 
> 
> The false positive count is clearly a bug, since I just classified
> 2 ham as spam, and I know I've done that often.  But I've never
> had to classify spam as ham.  Looks like fp & fn are inverted.

A "positive" means that the message was classified as spam, and a
"negative" means that it was classified as ham.  A "false positive",
then, is a message that was classified as spam when it should have been
ham and a "false negative" is a message that was classified as ham when
it should have been spam.  Unsures are not counted.  If you've never had
to reclassify something from spam to ham then you've never had a false
positive, and the 2 messages that you had to reclassify as spam were
false negatives because they weren't detected.  It looks to me like the
original statistics are correct here.

> The enclosed patch fixes that inversion, adds a few counters
> to tell which ham was manually identifed as spam and vice
> versa, as well as total ham/spam/manually reviewed, so
> one can calculate percentages.

Not sure why more counters are necessary.  We already count the number
of false negatives (fn) which are hams that were trained as spam, the
number of unsures that were trained as spam (trn_unsure_spam), and the
total number trained as spam (trn_spam).  The number of messages that
were correctly classified as spam and were also trained on is then
(trn_spam - trn_unsure_spam - fn).  The same can be done to calculate
the ham side.

> ... (The calculation is conservative;
> false positives/manually-reviewed ham, or false
> negatives/manually-reviewed spam, 
> so that unreviewed messages don't skew the percentages)

Taking percentages only out of trained messages tells you something
about your training regimen, but nothing about the accuracy of the
filter.  Filter accuracy is the percent of messages that were correctly
classified the first time compared to all messages received.  The
correct calculation for accuracy should be:

total_correct = (cls_spam - fp) + (cls_ham - fn)
acc = 100.0 * (total_correct / total)

Knowing the percent incorrectly classified is useful as well.  Unsures
play into accuracy in an unusual way because some people consider them
"mistakes" and some don't.  Showing the % correct, the % incorrect, and
the % unsure accounts for that.

> With the patch, Stats.py produces:
> Classified 1223 messages - 827 (68%) ham, 333 (27%) spam and 63 (5%)
> unsure. 
> Manually trained 760 messages:
> 340 of 375 ham messages manually confirmed (35 false positives 4.2%).
> 323 of 323 spam messages manually confirmed (0 false negatives 0.0%).
> Of 62 unsure messages, 6 (9.7%) manually identified as ham, 56
> (90.3%) as spam. 
> 
> I find this much more useful -- hope you agree.

I think it's a good start (with the exception of reversing the
definitions of false positives and false negatives <wink>).  Here's what
I've come up with for comparison (I've been playing with something
similar in the Outlook stats):

"""
SpamBayes has classified a total of 1223 messages:
    827 ham (67.6% of total)
    333 spam (27.2% of total)
    63 unsure (5.2% of total)

1125 messages were classified correctly (92.0% of total)
35 messages were classified incorrectly (2.9% of total)
    0 false positives (0.0% of total)
    35 false negatives (2.9% of total)

6 unsures trained as ham (9.5% of unsures)
56 unsures trained as spam (88.9% of unsures)
1 unsure was not trained (1.6% of unsures)

A total of 760 messages have been trained:
    346 ham (98.3% ham, 1.7% unsure, 0.0% false positives)
    414 spam (78.0% spam, 13.5% unsure, 8.5% false negatives)
"""

-- 
Kenny Pitt