I've finally managed to get something working with the Outlook addin and Skip's cool new ocrad stuff. the results look promising! :) A summary of my results are below. The runs are 'Tokenizer:x-image_size' and 'Tokenizer:x-crack_images' both set to False, versus both set to True. It looks like a 13% improvement in false negatives, which is nothing to sneeze at! I've never been an expert at reading these results though, so let me know if there is anything interesting I missed or neglected to send. Cheers, Mark false positive percentages <snip 10 lines of zeros> won 0 times tied 10 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 5.859 5.455 won -6.90% 5.410 4.363 won -19.35% 5.794 5.234 won -9.67% 3.008 2.820 won -6.25% 5.588 4.817 won -13.80% 4.469 3.166 won -29.16% 5.051 4.646 won -8.02% 5.829 5.647 won -3.12% 6.842 5.789 won -15.39% 7.356 5.964 won -18.92% won 10 times tied 0 times lost 0 times total unique fn went from 293 to 254 won -13.31% mean fn % went from 5.52048164035 to 4.79002154629 won -13.23% ham mean ham sdev 0.00 0.00 +(was 0) 0.05 0.05 +0.00% 0.03 0.04 +33.33% 0.48 0.70 +45.83% 0.06 0.04 -33.33% 0.82 0.53 -35.37% 0.08 0.08 +0.00% 1.53 1.53 +0.00% 0.00 0.00 +(was 0) 0.01 0.01 +0.00% 0.08 0.09 +12.50% 1.89 1.90 +0.53% 0.00 0.00 +(was 0) 0.08 0.08 +0.00% 0.00 0.00 +(was 0) 0.08 0.07 -12.50% 0.01 0.01 +0.00% 0.17 0.17 +0.00% 0.01 0.01 +0.00% 0.21 0.18 -14.29% ham mean and sdev for all runs 0.03 0.03 +0.00% 0.83 0.82 -1.20% spam mean spam sdev 90.19 90.67 +0.53% 24.58 23.81 -3.13% 91.43 91.74 +0.34% 22.99 22.24 -3.26% 91.51 91.85 +0.37% 23.66 22.79 -3.68% 93.62 93.94 +0.34% 18.73 17.90 -4.43% 90.62 91.11 +0.54% 23.83 22.99 -3.52% 91.07 91.66 +0.65% 22.71 21.31 -6.16% 90.85 91.31 +0.51% 23.28 22.58 -3.01% 89.74 90.19 +0.50% 25.29 24.57 -2.85% 89.49 90.15 +0.74% 25.99 24.73 -4.85% 88.57 89.45 +0.99% 26.84 25.00 -6.86% spam mean and sdev for all runs 90.72 91.21 +0.54% 23.91 22.90 -4.22% ham/spam mean difference: 90.69 91.18 +0.49
I wrote:
I've finally managed to get something working with the Outlook addin and Skip's cool new ocrad stuff. the results look promising! :)
Here are a few more details on what I am doing. To make things work with the image cracking code, I took the route of having the Outlook addin generate a valid multipart message when there are images. If there are no images, we return the same as we did in the past (ie, a singlepart message with text and HTML in the normal "body"), so where possible, the tokens generated for a message will be the same. When there are images, the tokens will now be different - due to the extra image cracking tokens (obviously), but also due to the different mime related tokens that will now be seen by the standard tokenizer. This is a fairly subtle change, but could be signficant to the classifier. For the purposes of comparison, I exported all ham and spam using the "old" scheme (ie, before images were handled), and with the new scheme but with image options disabled (but importantly, the new scheme *does* include the image data). The idea is to test only the impact of the new mime structure without looking at image content. I *think* these results are OK, but they are a little strange. Below is the result of cmp.py comparing the "old" scheme with the "new" scheme - note we won 6 times, lost 4 times, and never tied, with the best win by 29%, but the worst loss by 25%. Another value of "+900.00%" in "ham sdev" also appears extreme, but as I mentioned, I'm not very good at reading these. One thing I noticed is that the fact a message has a .gif attached is now a signficant spam clue - I expect those new tokens account for the significant swings in the results. Does anyone have comments about this? Cheers, and Happy Holidays! Mark <snip false positive percentages - all zero> false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 6.897 4.906 won -28.87% 6.777 6.367 won -6.05% 5.206 6.526 lost +25.36% 5.323 6.655 lost +25.02% 6.397 6.430 lost +0.52% 6.391 4.727 won -26.04% 5.587 5.204 won -6.86% 5.415 5.769 lost +6.54% 6.186 5.495 won -11.17% 6.470 6.239 won -3.57% won 6 times tied 0 times lost 4 times total unique fn went from 331 to 319 won -3.63% mean fn % went from 6.06478720218 to 5.8317685488 won -3.84% ham mean ham sdev 0.04 0.01 -75.00% 0.53 0.11 -79.25% 0.01 0.01 +0.00% 0.12 0.13 +8.33% 0.00 0.00 +(was 0) 0.01 0.10 +900.00% 0.00 0.00 +(was 0) 0.02 0.08 +300.00% 0.00 0.03 +(was 0) 0.00 0.66 +(was 0) 0.00 0.00 +(was 0) 0.06 0.01 -83.33% 0.05 0.02 -60.00% 1.05 0.28 -73.33% 0.10 0.03 -70.00% 1.67 0.47 -71.86% 0.01 0.05 +400.00% 0.14 0.87 +521.43% 0.02 0.08 +300.00% 0.29 1.36 +368.97% ham mean and sdev for all runs 0.02 0.02 +0.00% 0.66 0.59 -10.61% spam mean spam sdev 89.52 91.50 +2.21% 25.85 23.32 -9.79% 88.98 89.37 +0.44% 25.99 25.72 -1.04% 91.25 89.82 -1.57% 23.36 25.43 +8.86% 90.74 89.59 -1.27% 23.72 25.61 +7.97% 89.76 89.78 +0.02% 26.01 25.75 -1.00% 89.98 90.99 +1.12% 25.19 23.12 -8.22% 90.89 89.93 -1.06% 23.97 24.20 +0.96% 91.34 90.33 -1.11% 23.41 24.35 +4.02% 89.88 90.23 +0.39% 25.39 24.58 -3.19% 88.73 90.43 +1.92% 26.01 25.24 -2.96% spam mean and sdev for all runs 90.11 90.19 +0.09% 24.94 24.77 -0.68% ham/spam mean difference: 90.09 90.17 +0.08
participants (1)
-
Mark Hammond