[spambayes-dev] untested idea for calculating message lengths

Tony Meyer tameyer at ihug.co.nz
Mon Aug 2 06:26:33 CEST 2004


Sorry about the delay - last week was very busy for me - but I've managed to
give this a go, too.

Mixed results, but I'd call it a loss for me:

-> <stat> tested 4692 hams & 386 spams against 18762 hams & 1537 spams
-> <stat> tested 4695 hams & 381 spams against 18759 hams & 1542 spams
-> <stat> tested 4693 hams & 383 spams against 18761 hams & 1540 spams
-> <stat> tested 4690 hams & 384 spams against 18764 hams & 1539 spams
-> <stat> tested 4684 hams & 389 spams against 18770 hams & 1534 spams
-> <stat> tested 4691 hams & 385 spams against 18763 hams & 1538 spams
-> <stat> tested 4691 hams & 385 spams against 18763 hams & 1538 spams
-> <stat> tested 4691 hams & 385 spams against 18763 hams & 1538 spams
-> <stat> tested 4691 hams & 384 spams against 18763 hams & 1539 spams
-> <stat> tested 4690 hams & 384 spams against 18764 hams & 1539 spams

false positive percentages
    0.000  0.000  tied
    0.021  0.000  won   -100.00%
    0.000  0.000  tied
    0.000  0.064  lost  +(was 0)
    0.000  0.043  lost  +(was 0)

won   1 times
tied  2 times
lost  2 times

total unique fp went from 1 to 5 lost  +400.00%
mean fp % went from 0.00425985090522 to 0.0213192344457 lost  +400.47%

false negative percentages
    1.036  0.779  won    -24.81%
    1.050  1.299  lost   +23.71%
    0.783  1.299  lost   +65.90%
    1.823  1.042  won    -42.84%
    1.285  0.781  won    -39.22%

won   3 times
tied  0 times
lost  2 times

total unique fn went from 23 to 20 won    -13.04%
mean fn % went from 1.19553834481 to 1.03990800866 won    -13.02%

ham mean                     ham sdev
   0.09    0.05  -44.44%        1.85    1.44  -22.16%
   0.12    0.11   -8.33%        2.34    1.92  -17.95%
   0.12    0.09  -25.00%        2.06    1.67  -18.93%
   0.09    0.18 +100.00%        2.01    3.27  +62.69%
   0.04    0.16 +300.00%        0.88    3.00 +240.91%

ham mean and sdev for all runs
   0.09    0.12  +33.33%        1.89    2.38  +25.93%

spam mean                    spam sdev
  95.66   96.82   +1.21%       15.14   13.47  -11.03%
  95.73   96.88   +1.20%       15.31   13.94   -8.95%
  97.07   96.30   -0.79%       11.43   14.42  +26.16%
  95.32   95.68   +0.38%       16.78   15.08  -10.13%
  95.55   96.42   +0.91%       15.67   14.02  -10.53%

spam mean and sdev for all runs
  95.86   96.42   +0.58%       14.99   14.20   -5.27%

ham/spam mean difference: 95.77 96.30 +0.53

I'm not set up to be able to use the tte.py script like Skip did to get
lengths, but I set timcv.py to save the classifiers, and got these numbers:

token length    ham   spam
3                 0      1
4                 6     16
5               445    201
6              4910    277
7              7909    441
8              3883    271
9              1205    227
10              246     87
11              131     12
12               24      1

There's 12 times more ham than spam, of course, so I don't know that these
mean much.

With a smaller, but more balanced corpus, it's a wash (well, it gets one
fewer unsure):

-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied  5 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    6.870  6.870  tied
    3.125  3.125  tied
    7.813  7.813  tied
    3.906  3.906  tied
    5.469  5.469  tied

won   0 times
tied  5 times
lost  0 times

total unique fn went from 35 to 35 tied
mean fn % went from 5.43654580153 to 5.43654580153 tied

ham mean                     ham sdev
   0.18    0.14  -22.22%        1.77    1.40  -20.90%
   0.01    0.01   +0.00%        0.17    0.11  -35.29%
   0.01    0.01   +0.00%        0.12    0.08  -33.33%
   0.03    0.02  -33.33%        0.39    0.29  -25.64%
   0.28    0.27   -3.57%        3.37    3.27   -2.97%

ham mean and sdev for all runs
   0.10    0.09  -10.00%        1.72    1.60   -6.98%

spam mean                    spam sdev
  88.65   88.68   +0.03%       25.58   25.49   -0.35%
  89.82   89.65   -0.19%       23.25   23.49   +1.03%
  87.20   87.22   +0.02%       28.97   28.92   -0.17%
  90.75   90.79   +0.04%       23.91   23.85   -0.25%
  90.28   90.34   +0.07%       25.98   25.96   -0.08%

spam mean and sdev for all runs
  89.34   89.33   -0.01%       25.65   25.64   -0.04%

ham/spam mean difference: 89.24 89.24 -0.00

Tokens for these ones are:

token length    ham   spam
3                 1      0
4                11      0
5                53      2
6               221    266
7               297     88
8               213     94
9               115     50
10              168      3
11               16      8
12                5      0
13                8      0

The ratio here is about 2 ham to 1 spam, so, accounting for that, it looks
like (for this corpus, anyway), ham varies in size a lot more than spam.

=Tony Meyer



More information about the spambayes-dev mailing list