[Spambayes] RE: [Spambayes-checkins] spambayes tokenizer.py,1.40,1.41

Tim Peters tim.one@comcast.net
Sat, 28 Sep 2002 01:16:40 -0400


[Neil Schemenauer]
> ...
> Modified Files:
> 	tokenizer.py
> Log Message:
> Add basic message-id tokenization.  Right now it just checks that it
> exists and conforms to the usual syntax.  If it does, the host part is
> also returned.  I tried doing more but the extra stuff was never
> considered a good discriminator.  Stupid wins again. :-)

Neil, is there a reason to make this an option?  That is, as opposed to just
doing it all the time?  Like, could this screw up a mixed-source corpus
somehow?  An invalid message id is a strong spam indicator in my
mixed-source corpus for what appear to be legit reasons, and was strong
enough to get rid of 3(!) of my remaining 18 false negatives.

[Classifier] # all current defaults
robinson_probability_x = 0.5
robinson_minimum_prob_strength = 0.1
robinson_probability_s = 0.45
max_discriminators = 150

[Tokenizer]
mine_message_ids: False on the left and True on the right:

-> <stat> tested 2000 hams & 1400 spams against 18000 hams & 12600 spams
   ...

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.050  0.050  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 2 to 2 tied
mean fp % went from 0.01 to 0.01 tied

false negative percentages
    0.071  0.071  tied
    0.071  0.071  tied
    0.000  0.000  tied
    0.143  0.071  won    -50.35%
    0.143  0.143  tied
    0.214  0.214  tied
    0.143  0.143  tied
    0.214  0.143  won    -33.18%
    0.286  0.214  won    -25.17%
    0.000  0.000  tied

won   3 times
tied  7 times
lost  0 times

total unique fn went from 18 to 15 won    -16.67%
mean fn % went from 0.128571428571 to 0.107142857143 won    -16.67%

ham mean                     ham sdev
  28.24   28.00   -0.85%        5.81    5.80   -0.17%
  28.17   27.93   -0.85%        5.63    5.62   -0.18%
  28.16   27.91   -0.89%        5.75    5.76   +0.17%
  28.27   28.02   -0.88%        5.68    5.67   -0.18%
  28.07   27.82   -0.89%        5.85    5.85   +0.00%
  28.14   27.88   -0.92%        5.54    5.53   -0.18%
  28.30   28.05   -0.88%        5.69    5.69   +0.00%
  28.25   28.00   -0.88%        5.54    5.54   +0.00%
  28.39   28.14   -0.88%        5.61    5.61   +0.00%
  28.41   28.16   -0.88%        5.94    5.93   -0.17%

ham mean and sdev for all runs
  28.24   27.99   -0.89%        5.71    5.70   -0.18%

spam mean                    spam sdev
  84.88   85.00   +0.14%        6.96    6.92   -0.57%
  84.66   84.80   +0.17%        6.73    6.66   -1.04%
  84.35   84.48   +0.15%        6.62    6.57   -0.76%
  84.88   85.01   +0.15%        6.71    6.65   -0.89%
  84.89   85.01   +0.14%        6.54    6.49   -0.76%
  84.77   84.89   +0.14%        6.87    6.82   -0.73%
  84.46   84.61   +0.18%        6.76    6.68   -1.18%
  84.87   85.00   +0.15%        6.61    6.52   -1.36%
  84.88   85.02   +0.16%        6.85    6.78   -1.02%
  84.81   84.96   +0.18%        6.55    6.47   -1.22%

spam mean and sdev for all runs
  84.75   84.88   +0.15%        6.72    6.66   -0.89%

ham/spam mean difference: 56.51 56.89 +0.38

So it was a small but pure win, reduced the ham mean a little, increased the
spam mean a little and decreased its variance, and increased the mean spread
a little.  At these rates <wink>, can't get much purer than that (the effect
on ham variance was tiny and random)!

Unless someone sees a problem I'm missing, I recommend dropping the option
and always doing this.