patch to improve statistics from spambayes
Hi. While I'm generally very happy with Spambayes, I was a bit confused by the statistics, which didn't seem to add up. I'm using Spambayes 1.0a9, the web page says SpamBayes POP3 Proxy Version 0.4 (February 2004) on Windows 2000 SP4. I'm using POP3 interface (tried a couple of different mail agents, including a command-line POP3 fetch, OE and Mozilla mail -- I see similar results, these are with the command line fetch). I have 'Lookup message in cache' set to yes, Notate to: unsure, Classify subject spam. I suppress caching of bulk ham. After a POP3 fetch, the Statistics page says: SpamBayes has processed 1150 messages - 754 (66%) good, 333 (29%) spam and 63 (5%) unsure. 324 messages were manually classified as good (0 were false positives). 379 messages were manually classified as spam (33 were false negatives). 6 unsure messages were manually identified as good, and 52 as spam. ** 1. 6 unsure good + 52 unsure spam adds up to 58. But the processed line says 63? It's not clear how many messages were manually reviewed/trained. ** 2. It's not clear that manually classified as good helps figure out what was accurately classified as good, because that includes ham, spam and unsures that were so classified. Ditto for spam. It's not clear how the 324 manually classified as good relate to the 754 good, and the 379 manually classified as spam relate to the 333 spam? And as a result, it's hard to estimate accuracy. ** 3. After using the Review web page to train and mark all 4 unsure as spam, 2 ham as spam and leaving all spam as-is (yay!), I see: SpamBayes has processed 1150 messages - 754 (66%) good, 333 (29%) spam and 63 (5%) unsure. 333 messages were manually classified as good (0 were false positives). 414 messages were manually classified as spam (35 were false negatives). 6 unsure messages were manually identified as good, and 56 as spam. The false positive count is clearly a bug, since I just classified 2 ham as spam, and I know I've done that often. But I've never had to classify spam as ham. Looks like fp & fn are inverted. The enclosed patch fixes that inversion, adds a few counters to tell which ham was manually identifed as spam and vice versa, as well as total ham/spam/manually reviewed, so one can calculate percentages. (The calculation is conservative; false positives/manually-reviewed ham, or false negatives/manually-reviewed spam, so that unreviewed messages don't skew the percentages) Also trimmed the statements somewhat to avoid over-long lines. (removed some verbs:-) Before the enclosed patch, Stats.py produces: SpamBayes has processed 1223 messages - 827 (68%) good, 333 (27%) spam and 63 (5%) unsure. 346 messages were manually classified as good (0 were false positives). 414 messages were manually classified as spam (35 were false negatives). 6 unsure messages were manually identified as good, and 56 as spam. With the patch, Stats.py produces: Classified 1223 messages - 827 (68%) ham, 333 (27%) spam and 63 (5%) unsure. Manually trained 760 messages: 340 of 375 ham messages manually confirmed (35 false positives 4.2%). 323 of 323 spam messages manually confirmed (0 false negatives 0.0%). Of 62 unsure messages, 6 (9.7%) manually identified as ham, 56 (90.3%) as spam. I find this much more useful -- hope you agree. Regards, Mark.
Mark Moraes wrote:
While I'm generally very happy with Spambayes, I was a bit confused by the statistics, which didn't seem to add up.
I think there are some good ideas here, but looks like some misunderstandings as well. I'll see if I can clear those up a little. I've often wondered if we couldn't produce some more useful statistics, so maybe this a good start to a discussion.
6 unsure good + 52 unsure spam adds up to 58. But the processed line says 63? It's not clear how many messages were manually reviewed/trained.
This should indicate that there were 5 unsures that were not trained. I considered adding "and 5 were untrained" to the stats line.
After using the Review web page to train and mark all 4 unsure as spam, 2 ham as spam and leaving all spam as-is (yay!), I see:
SpamBayes has processed 1150 messages - 754 (66%) good, 333 (29%) spam and 63 (5%) unsure. 333 messages were manually classified as good (0 were false positives). 414 messages were manually classified as spam (35 were false negatives). 6 unsure messages were manually identified as good, and 56 as spam.
The false positive count is clearly a bug, since I just classified 2 ham as spam, and I know I've done that often. But I've never had to classify spam as ham. Looks like fp & fn are inverted.
A "positive" means that the message was classified as spam, and a "negative" means that it was classified as ham. A "false positive", then, is a message that was classified as spam when it should have been ham and a "false negative" is a message that was classified as ham when it should have been spam. Unsures are not counted. If you've never had to reclassify something from spam to ham then you've never had a false positive, and the 2 messages that you had to reclassify as spam were false negatives because they weren't detected. It looks to me like the original statistics are correct here.
The enclosed patch fixes that inversion, adds a few counters to tell which ham was manually identifed as spam and vice versa, as well as total ham/spam/manually reviewed, so one can calculate percentages.
Not sure why more counters are necessary. We already count the number of false negatives (fn) which are hams that were trained as spam, the number of unsures that were trained as spam (trn_unsure_spam), and the total number trained as spam (trn_spam). The number of messages that were correctly classified as spam and were also trained on is then (trn_spam - trn_unsure_spam - fn). The same can be done to calculate the ham side.
... (The calculation is conservative; false positives/manually-reviewed ham, or false negatives/manually-reviewed spam, so that unreviewed messages don't skew the percentages)
Taking percentages only out of trained messages tells you something about your training regimen, but nothing about the accuracy of the filter. Filter accuracy is the percent of messages that were correctly classified the first time compared to all messages received. The correct calculation for accuracy should be: total_correct = (cls_spam - fp) + (cls_ham - fn) acc = 100.0 * (total_correct / total) Knowing the percent incorrectly classified is useful as well. Unsures play into accuracy in an unusual way because some people consider them "mistakes" and some don't. Showing the % correct, the % incorrect, and the % unsure accounts for that.
With the patch, Stats.py produces: Classified 1223 messages - 827 (68%) ham, 333 (27%) spam and 63 (5%) unsure. Manually trained 760 messages: 340 of 375 ham messages manually confirmed (35 false positives 4.2%). 323 of 323 spam messages manually confirmed (0 false negatives 0.0%). Of 62 unsure messages, 6 (9.7%) manually identified as ham, 56 (90.3%) as spam.
I find this much more useful -- hope you agree.
I think it's a good start (with the exception of reversing the definitions of false positives and false negatives <wink>). Here's what I've come up with for comparison (I've been playing with something similar in the Outlook stats): """ SpamBayes has classified a total of 1223 messages: 827 ham (67.6% of total) 333 spam (27.2% of total) 63 unsure (5.2% of total) 1125 messages were classified correctly (92.0% of total) 35 messages were classified incorrectly (2.9% of total) 0 false positives (0.0% of total) 35 false negatives (2.9% of total) 6 unsures trained as ham (9.5% of unsures) 56 unsures trained as spam (88.9% of unsures) 1 unsure was not trained (1.6% of unsures) A total of 760 messages have been trained: 346 ham (98.3% ham, 1.7% unsure, 0.0% false positives) 414 spam (78.0% spam, 13.5% unsure, 8.5% false negatives) """ -- Kenny Pitt
[Kenny Pitt] """ SpamBayes has classified a total of 1223 messages: 827 ham (67.6% of total) 333 spam (27.2% of total) 63 unsure (5.2% of total)
1125 messages were classified correctly (92.0% of total) 35 messages were classified incorrectly (2.9% of total) 0 false positives (0.0% of total) 35 false negatives (2.9% of total)
6 unsures trained as ham (9.5% of unsures) 56 unsures trained as spam (88.9% of unsures) 1 unsure was not trained (1.6% of unsures)
A total of 760 messages have been trained: 346 ham (98.3% ham, 1.7% unsure, 0.0% false positives) 414 spam (78.0% spam, 13.5% unsure, 8.5% false negatives) """
That looks very useful, concise and complete. -- Seth Goodman
Based on Kenny Pitt's suggestions, I revised my statistics patch (enclosed the revised patch relative to 1.0a9, now that I understand the definition of false positive :-) An assumption of this form of calculation that's worth noting is that unreviewed/untrained messages must have been classified correctly (presumably otherwise the user would have trained those messages). Seems reasonable enough to me, but worth keeping in mind. (also, anyone who cares about looking at the statistics presumably cares enough to review/train often!) Regards, Mark. Kenny Pitt wrote:
Mark Moraes wrote:
... (The calculation is conservative; false positives/manually-reviewed ham, or false negatives/manually-reviewed spam, so that unreviewed messages don't skew the percentages)
Taking percentages only out of trained messages tells you something about your training regimen, but nothing about the accuracy of the filter. Filter accuracy is the percent of messages that were correctly classified the first time compared to all messages received. The correct calculation for accuracy should be:
total_correct = (cls_spam - fp) + (cls_ham - fn) acc = 100.0 * (total_correct / total)
SpamBayes has classified a total of 1223 messages: 827 ham (67.6% of total) 333 spam (27.2% of total) 63 unsure (5.2% of total)
1125 messages were classified correctly (92.0% of total) 35 messages were classified incorrectly (2.9% of total) 0 false positives (0.0% of total) 35 false negatives (2.9% of total)
--- Sample of current output: SpamBayes has classified a total of 1671 messages: 1139 ham (68.2% of total) 452 spam (27.0% of total) 80 unsure (4.8% of total) 1555 classified correctly (93.1% of total) 36 classified incorrectly (2.2% of total) 0 incorrectly identified as spam (false positive 0.0% of the total) 36 incorrectly identified as ham (false negative 2.2% of the total) 6 unsures trained as ham (7.5% of unsures) 73 unsures trained as spam (91.3% of unsures) 1 unsure was not trained (1.3% of unsures) A total of 943 messages have been trained: 393 ham (98.5% ham, 1.5% unsure, 0.0% false positives) 550 spam (80.2% spam, 0.0% unsure, 6.5% false negatives)
participants (3)
-
Kenny Pitt -
Mark Moraes -
Seth Goodman