[spambayes-dev] spammy subject lines

T. Alexander Popiel popiel at wolfskeep.com
Wed Oct 15 13:48:36 EDT 2003


In message:  <1ED4ECF91CDED24C8D012BCF2B034F13026F29A6 at its-xchg4.massey.ac.nz>
             "Tony Meyer" <tameyer at ihug.co.nz> writes:
>
>Of course, my data doesn't really tell us anything until we can compare
>it to someone else's...hopefully the OP, at least, will give this a go.

Well, here's my results (sorry about being slow... to get any mails with
the new obfuscated subject lines, I needed to regrab (and reclean) my
corpora into the testing framework):


output/newnormal -> output/subjstrip
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2151 hams & 5576 spams against 19367 hams & 50183 spams
-> <stat> tested 2151 hams & 5575 spams against 19367 hams & 50184 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams
-> <stat> tested 2151 hams & 5576 spams against 19367 hams & 50183 spams
-> <stat> tested 2151 hams & 5575 spams against 19367 hams & 50184 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.186  0.186  tied          
    0.093  0.093  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.093  0.093  tied          
    0.000  0.000  tied          
    0.046  0.046  tied          
    0.046  0.046  tied          

won   0 times
tied 10 times
lost  0 times

total unique fp went from 10 to 10 tied          
mean fp % went from 0.0464727221194 to 0.0464727221194 tied          

false negative percentages
    0.287  0.287  tied          
    0.412  0.412  tied          
    0.287  0.269  won     -6.27%
    0.305  0.305  tied          
    0.251  0.233  won     -7.17%
    0.377  0.359  won     -4.77%
    0.287  0.287  tied          
    0.377  0.395  lost    +4.77%
    0.215  0.233  lost    +8.37%
    0.269  0.269  tied          

won   3 times
tied  5 times
lost  2 times

total unique fn went from 171 to 170 won     -0.58%
mean fn % went from 0.306676274359 to 0.304882874073 won     -0.58%

ham mean                     ham sdev
   0.20    0.21   +5.00%        3.27    3.43   +4.89%
   0.20    0.20   +0.00%        2.78    2.78   +0.00%
   0.49    0.49   +0.00%        5.61    5.61   +0.00%
   0.24    0.24   +0.00%        4.02    4.03   +0.25%
   0.13    0.14   +7.69%        2.21    2.21   +0.00%
   0.14    0.14   +0.00%        2.68    2.65   -1.12%
   0.23    0.24   +4.35%        3.97    4.04   +1.76%
   0.14    0.14   +0.00%        2.71    2.70   -0.37%
   0.10    0.10   +0.00%        2.63    2.64   +0.38%
   0.30    0.31   +3.33%        4.18    4.24   +1.44%

ham mean and sdev for all runs
   0.22    0.22   +0.00%        3.55    3.57   +0.56%

spam mean                    spam sdev
  99.02   99.03   +0.01%        7.26    7.21   -0.69%
  98.87   98.90   +0.03%        8.18    8.08   -1.22%
  99.14   99.16   +0.02%        6.83    6.73   -1.46%
  98.90   98.91   +0.01%        7.73    7.70   -0.39%
  98.99   99.02   +0.03%        7.00    6.89   -1.57%
  98.88   98.89   +0.01%        7.90    7.85   -0.63%
  98.96   98.98   +0.02%        7.52    7.46   -0.80%
  98.87   98.89   +0.02%        8.05    8.01   -0.50%
  99.08   99.09   +0.01%        6.97    6.91   -0.86%
  99.15   99.17   +0.02%        6.83    6.75   -1.17%

spam mean and sdev for all runs
  98.99   99.00   +0.01%        7.44    7.38   -0.81%

ham/spam mean difference: 98.77 98.78 +0.01

--------

filename:  newnormal      
                   subjstrip
ham:spam:  21518:55759    
                   21518:55759
fp total:       10      10
fp %:         0.05    0.05
fn total:      171     170
fn %:         0.31    0.30
unsure t:     1098    1075
unsure %:     1.42    1.39
real cost: $490.60 $485.00
best cost: $395.40 $401.80
h mean:       0.22    0.22
h sdev:       3.55    3.57
s mean:      98.99   99.00
s sdev:       7.44    7.38
mean diff:   98.77   98.78
k:            8.99    9.02


Overall, it looks like a very minor win; it makes the ham less
distinct, but the spam more distinct.

- Alex



More information about the spambayes-dev mailing list