[Spambayes] RE: [Spambayes-checkins] spambayes tokenizer.py,1.40,1.41
Tim Peters
tim.one@comcast.net
Sat, 28 Sep 2002 01:16:40 -0400
[Neil Schemenauer]
> ...
> Modified Files:
> tokenizer.py
> Log Message:
> Add basic message-id tokenization. Right now it just checks that it
> exists and conforms to the usual syntax. If it does, the host part is
> also returned. I tried doing more but the extra stuff was never
> considered a good discriminator. Stupid wins again. :-)
Neil, is there a reason to make this an option? That is, as opposed to just
doing it all the time? Like, could this screw up a mixed-source corpus
somehow? An invalid message id is a strong spam indicator in my
mixed-source corpus for what appear to be legit reasons, and was strong
enough to get rid of 3(!) of my remaining 18 false negatives.
[Classifier] # all current defaults
robinson_probability_x = 0.5
robinson_minimum_prob_strength = 0.1
robinson_probability_s = 0.45
max_discriminators = 150
[Tokenizer]
mine_message_ids: False on the left and True on the right:
-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
...
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.050 0.050 tied
won 0 times
tied 10 times
lost 0 times
total unique fp went from 2 to 2 tied
mean fp % went from 0.01 to 0.01 tied
false negative percentages
0.071 0.071 tied
0.071 0.071 tied
0.000 0.000 tied
0.143 0.071 won -50.35%
0.143 0.143 tied
0.214 0.214 tied
0.143 0.143 tied
0.214 0.143 won -33.18%
0.286 0.214 won -25.17%
0.000 0.000 tied
won 3 times
tied 7 times
lost 0 times
total unique fn went from 18 to 15 won -16.67%
mean fn % went from 0.128571428571 to 0.107142857143 won -16.67%
ham mean ham sdev
28.24 28.00 -0.85% 5.81 5.80 -0.17%
28.17 27.93 -0.85% 5.63 5.62 -0.18%
28.16 27.91 -0.89% 5.75 5.76 +0.17%
28.27 28.02 -0.88% 5.68 5.67 -0.18%
28.07 27.82 -0.89% 5.85 5.85 +0.00%
28.14 27.88 -0.92% 5.54 5.53 -0.18%
28.30 28.05 -0.88% 5.69 5.69 +0.00%
28.25 28.00 -0.88% 5.54 5.54 +0.00%
28.39 28.14 -0.88% 5.61 5.61 +0.00%
28.41 28.16 -0.88% 5.94 5.93 -0.17%
ham mean and sdev for all runs
28.24 27.99 -0.89% 5.71 5.70 -0.18%
spam mean spam sdev
84.88 85.00 +0.14% 6.96 6.92 -0.57%
84.66 84.80 +0.17% 6.73 6.66 -1.04%
84.35 84.48 +0.15% 6.62 6.57 -0.76%
84.88 85.01 +0.15% 6.71 6.65 -0.89%
84.89 85.01 +0.14% 6.54 6.49 -0.76%
84.77 84.89 +0.14% 6.87 6.82 -0.73%
84.46 84.61 +0.18% 6.76 6.68 -1.18%
84.87 85.00 +0.15% 6.61 6.52 -1.36%
84.88 85.02 +0.16% 6.85 6.78 -1.02%
84.81 84.96 +0.18% 6.55 6.47 -1.22%
spam mean and sdev for all runs
84.75 84.88 +0.15% 6.72 6.66 -0.89%
ham/spam mean difference: 56.51 56.89 +0.38
So it was a small but pure win, reduced the ham mean a little, increased the
spam mean a little and decreased its variance, and increased the mean spread
a little. At these rates <wink>, can't get much purer than that (the effect
on ham variance was tiny and random)!
Unless someone sees a problem I'm missing, I recommend dropping the option
and always doing this.