[Spambayes] Another change

Tim Peters tim.one@comcast.net
Sun, 29 Sep 2002 16:34:58 -0400


Change checked in to tokenizer.py:

tokenize_headers():  Based on a silly experiment that *only* tokenized
Subject lines, added a gimmick here to generate tokens for runs of
punctuation characters (\W+) in subject lines.

-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
   [ditto 19 times]

false positive percentages
    0.050  0.000  won   -100.00%
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied

won   1 times
tied  9 times
lost  0 times

total unique fp went from 3 to 2 won    -33.33%
mean fp % went from 0.015 to 0.01 won    -33.33%

false negative percentages
    0.071  0.071  tied
    0.071  0.071  tied
    0.000  0.000  tied
    0.143  0.143  tied
    0.143  0.143  tied
    0.214  0.214  tied
    0.143  0.143  tied
    0.143  0.143  tied
    0.214  0.214  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 16 to 16 tied
mean fn % went from 0.114285714286 to 0.114285714286 tied

ham mean                     ham sdev
  25.74   25.65   -0.35%        5.74    5.67   -1.22%
  25.69   25.61   -0.31%        5.56    5.50   -1.08%
  25.64   25.57   -0.27%        5.74    5.67   -1.22%
  25.74   25.66   -0.31%        5.61    5.54   -1.25%
  25.50   25.42   -0.31%        5.78    5.72   -1.04%
  25.58   25.51   -0.27%        5.44    5.39   -0.92%
  25.73   25.65   -0.31%        5.63    5.59   -0.71%
  25.69   25.61   -0.31%        5.47    5.41   -1.10%
  25.92   25.84   -0.31%        5.54    5.48   -1.08%
  25.90   25.81   -0.35%        5.88    5.81   -1.19%

ham mean and sdev for all runs
  25.71   25.63   -0.31%        5.64    5.58   -1.06%

spam mean                    spam sdev
  84.07   83.86   -0.25%        7.10    7.09   -0.14%
  83.83   83.64   -0.23%        6.84    6.83   -0.15%
  83.46   83.27   -0.23%        6.80    6.81   +0.15%
  84.03   83.82   -0.25%        6.88    6.88   +0.00%
  84.08   83.89   -0.23%        6.68    6.65   -0.45%
  83.96   83.78   -0.21%        6.99    6.96   -0.43%
  83.62   83.42   -0.24%        6.84    6.82   -0.29%
  84.04   83.86   -0.21%        6.71    6.71   +0.00%
  84.08   83.88   -0.24%        7.01    6.98   -0.43%
  83.97   83.75   -0.26%        6.65    6.65   +0.00%

spam mean and sdev for all runs
  83.91   83.72   -0.23%        6.85    6.84   -0.15%

ham/spam mean difference: 58.20 58.09 -0.11

This is consistent but weak.  Staring at the false negatives shows
that it's moving them "in the right direction", though, and histogram
analysis says something stronger:

-> best cutoff for all runs: 0.55
->     with weighted total 10*2 fp + 11 fn = 31
->     fp rate 0.01%  fn rate 0.0786%

That is, if I had run at spam_cutoff 0.55 instead of 0.56, it would
have been a pure win, leaving f-p alone but dropping 5(!) of the f-n.