[Spambayes] options.skip_max_word_size.

Anthony Baxter anthony@interlink.com.au
Mon Oct 28 07:08:01 2002


I noticed a bunch of really nice ham clues were getting skipped in some
of my personal email's 'unsure' bucket. They were words like 'interconnection'
and other longer techie-words. I added an option skip_max_word_size and
tried boosting it to 20 (from the default of 12). 

cmp.py shows this (skip_max_word_size 12 on left, 20 on right)

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  4 times
lost  0 times

total unique fp went from 0 to 0 tied          
mean fp % went from 0.0 to 0.0 tied          

false negative percentages
    0.000  0.277  lost  +(was 0)
    0.559  0.838  lost   +49.91%
    0.836  1.114  lost   +33.25%
    0.279  0.836  lost  +199.64%

won   0 times
tied  0 times
lost  4 times

total unique fn went from 6 to 11 lost   +83.33%
mean fn % went from 0.418410855729 to 0.766214465324 lost   +83.12%

ham mean                     ham sdev
   0.67    0.58  -13.43%        4.54    3.89  -14.32%
   0.45    0.38  -15.56%        2.64    2.24  -15.15%
   0.68    0.67   -1.47%        4.44    4.57   +2.93%
   0.48    0.45   -6.25%        3.52    3.47   -1.42%

ham mean and sdev for all runs
   0.57    0.52   -8.77%        3.86    3.64   -5.70%

spam mean                    spam sdev
  98.30   98.76   +0.47%        8.61    7.95   -7.67%
  97.47   97.61   +0.14%       10.67   10.78   +1.03%
  98.51   98.43   -0.08%        9.13   10.93  +19.72%
  97.58   97.27   -0.32%       10.90   12.08  +10.83%

spam mean and sdev for all runs
  97.97   98.02   +0.05%        9.88   10.56   +6.88%

ham/spam mean difference: 97.40 97.50 +0.10

Unfortunately, cmp.py skips the important bit. My 'unsure' numbers
went from 164 to 135! 

I'm not sure if this is just something that's an artifact of my
own data, or more general - if others could try it as well, it 
would be good.

Anthony