[Spambayes] On counting words more than once
Tim Peters
tim.one@comcast.net
Sun, 29 Sep 2002 01:09:42 -0400
[Neil Schemenauer, tries
[Classifier]
count_duplicates_only_once_in_training: True
]
> It's a win for me:
>
> false positive percentages
> 0.000 0.000 tied
> 1.000 1.000 tied
> 0.000 0.000 tied
> 0.000 0.000 tied
> 0.500 0.500 tied
> 0.500 1.000 lost +100.00%
> 0.500 0.500 tied
> 0.000 0.000 tied
> 0.500 0.500 tied
> 0.000 0.500 lost +(was 0)
>
> won 0 times
> tied 8 times
> lost 2 times
>
> total unique fp went from 6 to 8 lost +33.33%
> mean fp % went from 0.3 to 0.4 lost +33.33%
>
> false negative percentages
> 0.000 0.000 tied
> 1.000 0.500 won -50.00%
> 1.000 0.500 won -50.00%
> 0.500 0.500 tied
> 2.000 1.500 won -25.00%
> 1.500 1.000 won -33.33%
> 0.000 0.000 tied
> 0.500 0.000 won -100.00%
> 0.500 0.000 won -100.00%
> 0.000 0.000 tied
>
> won 6 times
> tied 4 times
> lost 0 times
>
> total unique fn went from 14 to 8 won -42.86%
> mean fn % went from 0.7 to 0.4 won -42.86%
>
> ham mean ham sdev
> 30.01 27.92 -6.96% 8.43 8.42 -0.12%
> 28.50 26.74 -6.18% 8.83 8.69 -1.59%
> 27.93 26.04 -6.77% 8.20 7.94 -3.17%
> 29.55 27.33 -7.51% 8.24 8.23 -0.12%
> 29.05 27.19 -6.40% 8.28 8.15 -1.57%
> 31.40 29.48 -6.11% 9.41 9.25 -1.70%
> 29.31 27.49 -6.21% 8.13 8.10 -0.37%
> 29.33 27.16 -7.40% 7.86 7.89 +0.38%
> 28.72 27.22 -5.22% 9.05 8.97 -0.88%
> 29.04 26.87 -7.47% 7.28 7.22 -0.82%
>
> ham mean and sdev for all runs
> 29.28 27.34 -6.63% 8.44 8.35 -1.07%
>
> spam mean spam sdev
> 82.98 81.91 -1.29% 9.83 10.16 +3.36%
> 82.02 81.04 -1.19% 9.92 10.09 +1.71%
> 81.19 80.28 -1.12% 9.69 9.86 +1.75%
> 82.51 81.66 -1.03% 9.92 10.23 +3.13%
> 82.60 81.60 -1.21% 10.12 10.33 +2.08%
> 82.24 81.36 -1.07% 9.25 9.71 +4.97%
> 81.74 80.85 -1.09% 9.30 9.49 +2.04%
> 81.70 80.64 -1.30% 9.51 9.81 +3.15%
> 82.39 81.45 -1.14% 9.87 10.18 +3.14%
> 82.44 81.45 -1.20% 9.49 9.73 +2.53%
>
> spam mean and sdev for all runs
> 82.18 81.22 -1.17% 9.71 9.97 +2.68%
>
> ham/spam mean difference: 52.90 53.88 +0.98
So all the same things I saw: ham and spam means decrease, overall mean
spread increases, ham variance is a mixed bag, and spam variance increases.
Anyone else? Based on Neil and my results (which are all we have), we
should make this change and get rid of the option.
Ponder. I conjecture there's a set of words that are common in both ham and
spam, but are more likely to appear more often in spam than in ham. The
multiple-count gimmick would then give them larger spamprobs during
training. Taking that bias away thus stops penalizing ham for using them
too (mean decreases), but makes spam fuzzier (variance increases).
It also suggests there's *some* useful info to be had about how often a word
appears in a msg, but that adjusting spamprob isn't the right way to exploit
it. If Gary is dying for something to think about <wink>, is there a simple
way to model a multinomial distribution in this framework?