[Spambayes] On counting words more than once

Sun, 29 Sep 2002 01:09:42 -0400

[Neil Schemenauer, tries
   [Classifier]
   count_duplicates_only_once_in_training: True
]
> It's a win for me:
>
>     false positive percentages
>         0.000  0.000  tied
>         1.000  1.000  tied
>         0.000  0.000  tied
>         0.000  0.000  tied
>         0.500  0.500  tied
>         0.500  1.000  lost  +100.00%
>         0.500  0.500  tied
>         0.000  0.000  tied
>         0.500  0.500  tied
>         0.000  0.500  lost  +(was 0)
>
>     won   0 times
>     tied  8 times
>     lost  2 times
>
>     total unique fp went from 6 to 8 lost   +33.33%
>     mean fp % went from 0.3 to 0.4 lost   +33.33%
>
>     false negative percentages
>         0.000  0.000  tied
>         1.000  0.500  won    -50.00%
>         1.000  0.500  won    -50.00%
>         0.500  0.500  tied
>         2.000  1.500  won    -25.00%
>         1.500  1.000  won    -33.33%
>         0.000  0.000  tied
>         0.500  0.000  won   -100.00%
>         0.500  0.000  won   -100.00%
>         0.000  0.000  tied
>
>     won   6 times
>     tied  4 times
>     lost  0 times
>
>     total unique fn went from 14 to 8 won    -42.86%
>     mean fn % went from 0.7 to 0.4 won    -42.86%
>
>     ham mean                     ham sdev
>       30.01   27.92   -6.96%        8.43    8.42   -0.12%
>       28.50   26.74   -6.18%        8.83    8.69   -1.59%
>       27.93   26.04   -6.77%        8.20    7.94   -3.17%
>       29.55   27.33   -7.51%        8.24    8.23   -0.12%
>       29.05   27.19   -6.40%        8.28    8.15   -1.57%
>       31.40   29.48   -6.11%        9.41    9.25   -1.70%
>       29.31   27.49   -6.21%        8.13    8.10   -0.37%
>       29.33   27.16   -7.40%        7.86    7.89   +0.38%
>       28.72   27.22   -5.22%        9.05    8.97   -0.88%
>       29.04   26.87   -7.47%        7.28    7.22   -0.82%
>
>     ham mean and sdev for all runs
>       29.28   27.34   -6.63%        8.44    8.35   -1.07%
>
>     spam mean                    spam sdev
>       82.98   81.91   -1.29%        9.83   10.16   +3.36%
>       82.02   81.04   -1.19%        9.92   10.09   +1.71%
>       81.19   80.28   -1.12%        9.69    9.86   +1.75%
>       82.51   81.66   -1.03%        9.92   10.23   +3.13%
>       82.60   81.60   -1.21%       10.12   10.33   +2.08%
>       82.24   81.36   -1.07%        9.25    9.71   +4.97%
>       81.74   80.85   -1.09%        9.30    9.49   +2.04%
>       81.70   80.64   -1.30%        9.51    9.81   +3.15%
>       82.39   81.45   -1.14%        9.87   10.18   +3.14%
>       82.44   81.45   -1.20%        9.49    9.73   +2.53%
>
>     spam mean and sdev for all runs
>       82.18   81.22   -1.17%        9.71    9.97   +2.68%
>
>     ham/spam mean difference: 52.90 53.88 +0.98

So all the same things I saw:  ham and spam means decrease, overall mean
spread increases, ham variance is a mixed bag, and spam variance increases.

Anyone else?  Based on Neil and my results (which are all we have), we
should make this change and get rid of the option.

Ponder.  I conjecture there's a set of words that are common in both ham and
spam, but are more likely to appear more often in spam than in ham.  The
multiple-count gimmick would then give them larger spamprobs during
training.  Taking that bias away thus stops penalizing ham for using them
too (mean decreases), but makes spam fuzzier (variance increases).

It also suggests there's *some* useful info to be had about how often a word
appears in a msg, but that adjusting spamprob isn't the right way to exploit
it.  If Gary is dying for something to think about <wink>, is there a simple
way to model a multinomial distribution in this framework?