[Spambayes] new option: generate_long_skips

Skip Montanaro skip@pobox.com
Mon, 30 Sep 2002 17:43:38 -0500


    > I just checked in a new option for the tokenizer: generate_long_skips.
    ...
    > I am currently running a test with 10 sets of 200 messages per set.

Almost exactly the same:

    cutoffs -> noskipss
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams

    false positive percentages
        1.000  1.000  tied          
        1.500  1.500  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        1.500  1.000  won    -33.33%
        3.500  3.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          
        1.500  1.500  tied          

    won   1 times
    tied  9 times
    lost  0 times

    total unique fp went from 30 to 29 won     -3.33%
    mean fp % went from 1.5 to 1.45 won     -3.33%

    false negative percentages
        0.500  0.500  tied          
        1.500  1.500  tied          
        0.500  0.500  tied          
        0.500  1.000  lost  +100.00%
        2.000  2.000  tied          
        0.000  0.000  tied          
        1.000  1.000  tied          
        1.000  1.000  tied          
        0.000  0.000  tied          
        1.500  1.500  tied          

    won   0 times
    tied  9 times
    lost  1 times

    total unique fn went from 17 to 18 lost    +5.88%
    mean fn % went from 0.85 to 0.9 lost    +5.88%

    ham mean                     ham sdev
      20.82   20.59   -1.10%        6.43    6.18   -3.89%
      21.86   21.66   -0.91%        6.63    6.26   -5.58%
      21.38   21.26   -0.56%        6.49    6.30   -2.93%
      21.96   21.79   -0.77%        6.26    6.24   -0.32%
      21.51   21.29   -1.02%        6.72    6.62   -1.49%
      21.66   21.43   -1.06%        6.98    6.84   -2.01%
      21.45   21.32   -0.61%        7.66    7.65   -0.13%
      21.74   21.51   -1.06%        6.69    6.64   -0.75%
      21.71   21.49   -1.01%        7.44    7.21   -3.09%
      21.87   21.73   -0.64%        5.93    5.92   -0.17%

    ham mean and sdev for all runs
      21.60   21.41   -0.88%        6.75    6.61   -2.07%

    spam mean                    spam sdev
      74.10   73.56   -0.73%       12.99   13.13   +1.08%
      72.47   71.74   -1.01%       13.92   13.78   -1.01%
      74.05   73.52   -0.72%       13.00   13.07   +0.54%
      74.00   73.54   -0.62%       12.27   12.16   -0.90%
      72.43   71.91   -0.72%       13.73   13.52   -1.53%
      72.68   72.24   -0.61%       13.27   13.26   -0.08%
      72.57   71.84   -1.01%       13.03   12.99   -0.31%
      71.50   71.30   -0.28%       12.12   12.14   +0.17%
      73.25   72.68   -0.78%       12.67   12.53   -1.10%
      73.02   72.70   -0.44%       12.44   12.43   -0.08%

    spam mean and sdev for all runs
      73.01   72.50   -0.70%       12.98   12.94   -0.31%

    ham/spam mean difference: 51.41 51.09 -0.32

I notice it's suggesting an even lower cutoff now (0.375).

Before:

    -> best cutoff for all runs: 0.4
    ->     with weighted total 1*30 fp + 17 fn = 47
    ->     fp rate 1.5%  fn rate 0.85%

After:

    -> best cutoff for all runs: 0.375
    ->     with weighted total 1*35 fp + 7 fn = 42
    ->     fp rate 1.75%  fn rate 0.35%

Skip