[spambayes-dev] untested idea for calculating message lengths
Tony Meyer
tameyer at ihug.co.nz
Mon Aug 2 06:26:33 CEST 2004
Sorry about the delay - last week was very busy for me - but I've managed to
give this a go, too.
Mixed results, but I'd call it a loss for me:
-> <stat> tested 4692 hams & 386 spams against 18762 hams & 1537 spams
-> <stat> tested 4695 hams & 381 spams against 18759 hams & 1542 spams
-> <stat> tested 4693 hams & 383 spams against 18761 hams & 1540 spams
-> <stat> tested 4690 hams & 384 spams against 18764 hams & 1539 spams
-> <stat> tested 4684 hams & 389 spams against 18770 hams & 1534 spams
-> <stat> tested 4691 hams & 385 spams against 18763 hams & 1538 spams
-> <stat> tested 4691 hams & 385 spams against 18763 hams & 1538 spams
-> <stat> tested 4691 hams & 385 spams against 18763 hams & 1538 spams
-> <stat> tested 4691 hams & 384 spams against 18763 hams & 1539 spams
-> <stat> tested 4690 hams & 384 spams against 18764 hams & 1539 spams
false positive percentages
0.000 0.000 tied
0.021 0.000 won -100.00%
0.000 0.000 tied
0.000 0.064 lost +(was 0)
0.000 0.043 lost +(was 0)
won 1 times
tied 2 times
lost 2 times
total unique fp went from 1 to 5 lost +400.00%
mean fp % went from 0.00425985090522 to 0.0213192344457 lost +400.47%
false negative percentages
1.036 0.779 won -24.81%
1.050 1.299 lost +23.71%
0.783 1.299 lost +65.90%
1.823 1.042 won -42.84%
1.285 0.781 won -39.22%
won 3 times
tied 0 times
lost 2 times
total unique fn went from 23 to 20 won -13.04%
mean fn % went from 1.19553834481 to 1.03990800866 won -13.02%
ham mean ham sdev
0.09 0.05 -44.44% 1.85 1.44 -22.16%
0.12 0.11 -8.33% 2.34 1.92 -17.95%
0.12 0.09 -25.00% 2.06 1.67 -18.93%
0.09 0.18 +100.00% 2.01 3.27 +62.69%
0.04 0.16 +300.00% 0.88 3.00 +240.91%
ham mean and sdev for all runs
0.09 0.12 +33.33% 1.89 2.38 +25.93%
spam mean spam sdev
95.66 96.82 +1.21% 15.14 13.47 -11.03%
95.73 96.88 +1.20% 15.31 13.94 -8.95%
97.07 96.30 -0.79% 11.43 14.42 +26.16%
95.32 95.68 +0.38% 16.78 15.08 -10.13%
95.55 96.42 +0.91% 15.67 14.02 -10.53%
spam mean and sdev for all runs
95.86 96.42 +0.58% 14.99 14.20 -5.27%
ham/spam mean difference: 95.77 96.30 +0.53
I'm not set up to be able to use the tte.py script like Skip did to get
lengths, but I set timcv.py to save the classifiers, and got these numbers:
token length ham spam
3 0 1
4 6 16
5 445 201
6 4910 277
7 7909 441
8 3883 271
9 1205 227
10 246 87
11 131 12
12 24 1
There's 12 times more ham than spam, of course, so I don't know that these
mean much.
With a smaller, but more balanced corpus, it's a wash (well, it gets one
fewer unsure):
-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 280 hams & 131 spams against 1111 hams & 512 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 277 hams & 128 spams against 1114 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
-> <stat> tested 278 hams & 128 spams against 1113 hams & 515 spams
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
won 0 times
tied 5 times
lost 0 times
total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied
false negative percentages
6.870 6.870 tied
3.125 3.125 tied
7.813 7.813 tied
3.906 3.906 tied
5.469 5.469 tied
won 0 times
tied 5 times
lost 0 times
total unique fn went from 35 to 35 tied
mean fn % went from 5.43654580153 to 5.43654580153 tied
ham mean ham sdev
0.18 0.14 -22.22% 1.77 1.40 -20.90%
0.01 0.01 +0.00% 0.17 0.11 -35.29%
0.01 0.01 +0.00% 0.12 0.08 -33.33%
0.03 0.02 -33.33% 0.39 0.29 -25.64%
0.28 0.27 -3.57% 3.37 3.27 -2.97%
ham mean and sdev for all runs
0.10 0.09 -10.00% 1.72 1.60 -6.98%
spam mean spam sdev
88.65 88.68 +0.03% 25.58 25.49 -0.35%
89.82 89.65 -0.19% 23.25 23.49 +1.03%
87.20 87.22 +0.02% 28.97 28.92 -0.17%
90.75 90.79 +0.04% 23.91 23.85 -0.25%
90.28 90.34 +0.07% 25.98 25.96 -0.08%
spam mean and sdev for all runs
89.34 89.33 -0.01% 25.65 25.64 -0.04%
ham/spam mean difference: 89.24 89.24 -0.00
Tokens for these ones are:
token length ham spam
3 1 0
4 11 0
5 53 2
6 221 266
7 297 88
8 213 94
9 115 50
10 168 3
11 16 8
12 5 0
13 8 0
The ratio here is about 2 ham to 1 spam, so, accounting for that, it looks
like (for this corpus, anyway), ham varies in size a lot more than spam.
=Tony Meyer
More information about the spambayes-dev
mailing list