[Spambayes] New tokenization of the Subject line
Remi Ricard
papaDoc@videotron.ca
Fri, 04 Oct 2002 21:26:25 -0400
Hi,
I try something again.
Since most of the mail from subscribed groups have in their
subject [spambayes] or [freesco] i.e "[" and "]".
I decided to keep this as a word so my words from a subject line
like: Re: [Spambayes] Moving closer to Gary's ideal
will be
Re:
[Spambayes]
Moving
closer
to
Gary's
ideal
And this is the result.
-> <stat> tested 200 hams & 279 spams against 800 hams & 1113 spams
-> <stat> tested 200 hams & 275 spams against 800 hams & 1117 spams
-> <stat> tested 200 hams & 298 spams against 800 hams & 1094 spams
-> <stat> tested 200 hams & 272 spams against 800 hams & 1120 spams
-> <stat> tested 200 hams & 268 spams against 800 hams & 1124 spams
-> <stat> tested 200 hams & 279 spams against 800 hams & 1113 spams
-> <stat> tested 200 hams & 275 spams against 800 hams & 1117 spams
-> <stat> tested 200 hams & 298 spams against 800 hams & 1094 spams
-> <stat> tested 200 hams & 272 spams against 800 hams & 1120 spams
-> <stat> tested 200 hams & 268 spams against 800 hams & 1124 spams
false positive percentages
1.000 0.500 won -50.00%
1.500 1.500 tied
2.000 2.500 lost +25.00%
1.000 1.000 tied
0.000 0.000 tied
won 1 times
tied 3 times
lost 1 times
total unique fp went from 11 to 11 tied
mean fp % went from 1.1 to 1.1 tied
false negative percentages
0.717 0.717 tied
0.727 0.727 tied
1.007 1.342 lost +33.27%
0.000 0.368 lost +(was 0)
0.746 0.373 won -50.00%
won 1 times
tied 2 times
lost 2 times
total unique fn went from 9 to 10 lost +11.11%
mean fn % went from 0.639419734305 to 0.705436374356 lost +10.32%
ham mean ham sdev
24.51 25.20 +2.82% 9.45 9.09 -3.81%
26.14 27.20 +4.06% 8.62 8.32 -3.48%
26.04 26.94 +3.46% 10.00 9.68 -3.20%
25.15 25.85 +2.78% 8.05 7.93 -1.49%
25.12 26.11 +3.94% 8.28 8.16 -1.45%
ham mean and sdev for all runs
25.39 26.26 +3.43% 8.93 8.69 -2.69%
spam mean spam sdev
80.41 79.86 -0.68% 8.80 8.81 +0.11%
79.87 79.47 -0.50% 8.20 8.11 -1.10%
79.87 79.31 -0.70% 8.79 8.73 -0.68%
80.42 80.03 -0.48% 8.13 8.22 +1.11%
80.11 79.70 -0.51% 9.32 9.07 -2.68%
spam mean and sdev for all runs
80.13 79.66 -0.59% 8.66 8.60 -0.69%
ham/spam mean difference: 54.74 53.40 -1.34
I'm still having problem reading the result can someone
explain this a little bit.
My statistic knowledge is comming from a course I took
almost 15 years ago and it was the only course I manage
to fell asleep in it..... even if I like math (I did a
B.Sc in physics).
papaDoc