[Spambayes] Another change
Tim Peters
tim.one@comcast.net
Sun, 29 Sep 2002 16:34:58 -0400
Change checked in to tokenizer.py:
tokenize_headers(): Based on a silly experiment that *only* tokenized
Subject lines, added a gimmick here to generate tokens for runs of
punctuation characters (\W+) in subject lines.
-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
[ditto 19 times]
false positive percentages
0.050 0.000 won -100.00%
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
won 1 times
tied 9 times
lost 0 times
total unique fp went from 3 to 2 won -33.33%
mean fp % went from 0.015 to 0.01 won -33.33%
false negative percentages
0.071 0.071 tied
0.071 0.071 tied
0.000 0.000 tied
0.143 0.143 tied
0.143 0.143 tied
0.214 0.214 tied
0.143 0.143 tied
0.143 0.143 tied
0.214 0.214 tied
0.000 0.000 tied
won 0 times
tied 10 times
lost 0 times
total unique fn went from 16 to 16 tied
mean fn % went from 0.114285714286 to 0.114285714286 tied
ham mean ham sdev
25.74 25.65 -0.35% 5.74 5.67 -1.22%
25.69 25.61 -0.31% 5.56 5.50 -1.08%
25.64 25.57 -0.27% 5.74 5.67 -1.22%
25.74 25.66 -0.31% 5.61 5.54 -1.25%
25.50 25.42 -0.31% 5.78 5.72 -1.04%
25.58 25.51 -0.27% 5.44 5.39 -0.92%
25.73 25.65 -0.31% 5.63 5.59 -0.71%
25.69 25.61 -0.31% 5.47 5.41 -1.10%
25.92 25.84 -0.31% 5.54 5.48 -1.08%
25.90 25.81 -0.35% 5.88 5.81 -1.19%
ham mean and sdev for all runs
25.71 25.63 -0.31% 5.64 5.58 -1.06%
spam mean spam sdev
84.07 83.86 -0.25% 7.10 7.09 -0.14%
83.83 83.64 -0.23% 6.84 6.83 -0.15%
83.46 83.27 -0.23% 6.80 6.81 +0.15%
84.03 83.82 -0.25% 6.88 6.88 +0.00%
84.08 83.89 -0.23% 6.68 6.65 -0.45%
83.96 83.78 -0.21% 6.99 6.96 -0.43%
83.62 83.42 -0.24% 6.84 6.82 -0.29%
84.04 83.86 -0.21% 6.71 6.71 +0.00%
84.08 83.88 -0.24% 7.01 6.98 -0.43%
83.97 83.75 -0.26% 6.65 6.65 +0.00%
spam mean and sdev for all runs
83.91 83.72 -0.23% 6.85 6.84 -0.15%
ham/spam mean difference: 58.20 58.09 -0.11
This is consistent but weak. Staring at the false negatives shows
that it's moving them "in the right direction", though, and histogram
analysis says something stronger:
-> best cutoff for all runs: 0.55
-> with weighted total 10*2 fp + 11 fn = 31
-> fp rate 0.01% fn rate 0.0786%
That is, if I had run at spam_cutoff 0.55 instead of 0.56, it would
have been a pure win, leaving f-p alone but dropping 5(!) of the f-n.