[Spambayes] Here's why "generate_long_skips: False" worked...

Neil Schemenauer nas@python.ca
Mon, 30 Sep 2002 20:21:00 -0700


Tim Peters wrote:
> [Neil Schemenauer]
> > I tried generating 2 character-grams when has_highbit_char was true.
> 
> In addition to, or in lieu of, generating skip tokens?

In addition.

> 1. Current vs doing character 2-grams when has_highbit_char is true
>    instead of generating skip tokens.

Left is current:

    false positive percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 5 to 5 tied          
    mean fp % went from 0.25 to 0.25 tied          

    false negative percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.500  0.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fn went from 15 to 15 tied          
    mean fn % went from 0.75 to 0.75 tied          

    ham mean                     ham sdev
      27.66   27.62   -0.14%        8.52    8.51   -0.12%
      26.51   26.47   -0.15%        8.75    8.79   +0.46%
      25.82   25.76   -0.23%        7.92    7.91   -0.13%
      27.03   27.00   -0.11%        8.22    8.28   +0.73%
      26.95   26.88   -0.26%        8.21    8.26   +0.61%
      29.23   29.19   -0.14%        9.28    9.27   -0.11%
      27.25   27.20   -0.18%        8.15    8.16   +0.12%
      26.89   26.83   -0.22%        7.88    7.89   +0.13%
      27.02   26.93   -0.33%        9.02    8.99   -0.33%
      26.63   26.57   -0.23%        7.20    7.18   -0.28%

    ham mean and sdev for all runs
      27.10   27.05   -0.18%        8.38    8.39   +0.12%

    spam mean                    spam sdev
      81.73   82.38   +0.80%       10.24   10.96   +7.03%
      80.90   81.56   +0.82%       10.16   10.96   +7.87%
      80.03   81.11   +1.35%        9.99   11.02  +10.31%
      81.51   82.48   +1.19%       10.28   11.29   +9.82%
      81.44   82.31   +1.07%       10.43   11.13   +6.71%
      81.11   82.17   +1.31%        9.82   10.87  +10.69%
      80.64   81.69   +1.30%        9.52   10.47   +9.98%
      80.43   81.48   +1.31%        9.84   10.74   +9.15%
      81.18   82.02   +1.03%       10.25   10.91   +6.44%
      81.17   82.59   +1.75%        9.90   11.10  +12.12%

    spam mean and sdev for all runs
      81.01   81.98   +1.20%       10.06   10.96   +8.95%

    ham/spam mean difference: 53.91 54.93 +1.02

> 
> 2. Current vs doing character 2-grams when has_highbit_char is true
>    in addition to generating skip tokens.

Again, left is current:

    false positive percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          
        0.500  0.500  tied          
        0.000  0.000  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 5 to 5 tied          
    mean fp % went from 0.25 to 0.25 tied          

    false negative percentages
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.500  0.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  0.500  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fn went from 15 to 15 tied          
    mean fn % went from 0.75 to 0.75 tied          

    ham mean                     ham sdev
      27.66   27.66   +0.00%        8.52    8.52   +0.00%
      26.51   26.52   +0.04%        8.75    8.79   +0.46%
      25.82   25.82   +0.00%        7.92    7.92   +0.00%
      27.03   27.06   +0.11%        8.22    8.28   +0.73%
      26.95   26.96   +0.04%        8.21    8.25   +0.49%
      29.23   29.23   +0.00%        9.28    9.28   +0.00%
      27.25   27.26   +0.04%        8.15    8.16   +0.12%
      26.89   26.89   +0.00%        7.88    7.88   +0.00%
      27.02   27.02   +0.00%        9.02    9.02   +0.00%
      26.63   26.63   +0.00%        7.20    7.20   +0.00%

    ham mean and sdev for all runs
      27.10   27.10   +0.00%        8.38    8.39   +0.12%

    spam mean                    spam sdev
      81.73   82.51   +0.95%       10.24   11.00   +7.42%
      80.90   81.66   +0.94%       10.16   10.98   +8.07%
      80.03   81.24   +1.51%        9.99   11.18  +11.91%
      81.51   82.58   +1.31%       10.28   11.35  +10.41%
      81.44   82.38   +1.15%       10.43   11.17   +7.09%
      81.11   82.29   +1.45%        9.82   10.91  +11.10%
      80.64   81.78   +1.41%        9.52   10.48  +10.08%
      80.43   81.57   +1.42%        9.84   10.80   +9.76%
      81.18   82.13   +1.17%       10.25   10.96   +6.93%
      81.17   82.71   +1.90%        9.90   11.22  +13.33%

    spam mean and sdev for all runs
      81.01   82.09   +1.33%       10.06   11.02   +9.54%

    ham/spam mean difference: 53.91 54.99 +1.08