[spambayes-dev] New results for DNS lookup in tokenizer

Matthew Dixon Cowles matt at mondoinfo.com
Thu Apr 15 21:37:26 EDT 2004


It turns out that I was right when I speculated that using DNS
lookups would work better on more-recent spam. I re-did my spam sets
from the thousand most recent spams in my spam archive and got rather
better results:


new-pick-aparts.txt -> new-dnss.txt
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams
-> <stat> tested 200 hams & 200 spams against 800 hams & 800 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.500  0.500  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.1 to 0.1 tied          

false negative percentages
    0.500  0.000  won   -100.00%
    4.500  3.500  won    -22.22%
    1.000  0.500  won    -50.00%
    0.000  0.000  tied          
    3.000  2.500  won    -16.67%

won   4 times
tied  1 times
lost  0 times

total unique fn went from 18 to 13 won    -27.78%
mean fn % went from 1.8 to 1.3 won    -27.78%

ham mean                     ham sdev
   0.34    0.33   -2.94%        3.28    3.22   -1.83%
   0.14    0.14   +0.00%        1.39    1.36   -2.16%
   0.50    0.50   +0.00%        6.84    6.84   +0.00%
   0.48    0.31  -35.42%        3.75    2.10  -44.00%
   0.35    0.38   +8.57%        3.78    4.15   +9.79%

ham mean and sdev for all runs
   0.36    0.33   -8.33%        4.19    4.02   -4.06%

spam mean                    spam sdev
  98.00   98.49   +0.50%       10.43    8.53  -18.22%
  94.60   95.38   +0.82%       19.89   18.38   -7.59%
  97.52   97.96   +0.45%       11.40   10.63   -6.75%
  98.77   98.87   +0.10%        6.47    6.81   +5.26%
  94.78   95.38   +0.63%       18.47   17.51   -5.20%

spam mean and sdev for all runs
  96.73   97.22   +0.51%       14.37   13.33   -7.24%

ham/spam mean difference: 96.37 96.89 +0.52


In addition, unsures decreased some:


filename:  new-pick-apart         
                           new-dns
ham:spam:    1000:1000   1000:1000
fp total:            1           1
fp %:             0.10        0.10
fn total:           18          13
fn %:             1.80        1.30
unsure t:           46          40
unsure %:         2.30        2.00
real cost:      $37.20      $31.00
best cost:      $21.60      $19.80
h mean:           0.36        0.33
h sdev:           4.19        4.02
s mean:          96.73       97.22
s sdev:          14.37       13.33
mean diff:       96.37       96.89
k:                5.19        5.58


That's not an enormous win but it suggests that I probably am seeing
the improvement in my inbox that I think I'm seeing. And the
false-negatives that are eliminated are nonsense spams or spams with
lots of bland, unrelated text in them.

It's very arguable that a technique that only works well on recent
spam shouldn't be included in SpamBayes until it has proven its value
over some time.

Regards,
Matt




More information about the spambayes-dev mailing list