[spambayes-dev] RE: [Spambayes] SpamBayes old and new
Tim Peters
tim.one at comcast.net
Thu Jan 8 16:29:34 EST 2004
[followups to spambayes-dev at python.org please, since it would get
increasingly technical beyond this point]
[Simone Piunno]
>> Just out of curiosity, I've read this essay by Greg Louis:
>>
>> http://www.bgl.nu/bogofilter/bayes.html
>>
>> I find it has interesting considerations on the balance problem.
>> Did you know this essay? Have you ever tried how it works?
[Tim Peters]
> ...
> Alex here did a relevant experiment, but the report is lacking some
> needed detail:
http://mail.python.org/pipermail/spambayes-dev/2003-November/001592.html
I ran a test on my own recent email mix, using current Outlook addin
defaults. "base" is the current code. "bycount" replaces one line in
classifier.py, from
prob = spamratio / (hamratio + spamratio)
to
prob = float(spamcount) / (spamcount + hamcount)
Results are certainly ... remarkable. Since my incoming email is naturally
unbalanced in a 4::1 ham::spam ratio lately, it's a more interesting test
than Greg's nearly-balanced test:
base -> bycount
-> <stat> tested 528 hams & 130 spams against 4752 hams & 1170 spams
<19 repetitions deleted>
false positive percentages
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.000 0.000 tied
0.189 0.000 won -100.00%
0.189 0.000 won -100.00%
0.379 0.000 won -100.00%
0.000 0.000 tied
0.000 0.000 tied
won 3 times
tied 7 times
lost 0 times
total unique fp went from 4 to 0 won -100.00%
mean fp % went from 0.0757575757576 to 0.0 won -100.00%
false negative percentages
0.769 16.154 lost +2000.65%
0.769 23.077 lost +2900.91%
0.000 19.231 lost +(was 0)
0.769 23.077 lost +2900.91%
0.769 23.846 lost +3000.91%
0.769 16.154 lost +2000.65%
1.538 26.923 lost +1650.52%
0.000 12.308 lost +(was 0)
1.538 20.000 lost +1200.39%
1.538 17.692 lost +1050.33%
won 0 times
tied 0 times
lost 10 times
total unique fn went from 11 to 258 lost +2245.45%
mean fn % went from 0.846153846153 to 19.8461538462 lost +2245.45%
ham mean ham sdev
0.38 0.00 -100.00% 3.57 0.00 -100.00%
0.34 0.00 -100.00% 3.70 0.09 -97.57%
0.07 0.00 -100.00% 0.85 0.00 -100.00%
0.03 0.00 -100.00% 0.43 0.00 -100.00%
0.34 0.00 -100.00% 4.08 0.01 -99.75%
0.26 0.00 -100.00% 4.36 0.00 -100.00%
0.28 0.00 -100.00% 4.32 0.00 -100.00%
0.55 0.00 -100.00% 6.44 0.00 -100.00%
0.28 0.00 -100.00% 3.40 0.00 -100.00%
0.29 0.00 -100.00% 3.24 0.00 -100.00%
ham mean and sdev for all runs
0.28 0.00 -100.00% 3.81 0.03 -99.21%
spam mean spam sdev
96.12 63.99 -33.43% 14.01 32.86 +134.55%
97.15 58.20 -40.09% 12.56 35.04 +178.98%
97.58 58.34 -40.21% 8.75 34.93 +299.20%
97.72 58.61 -40.02% 10.38 36.75 +254.05%
97.07 57.33 -40.94% 11.68 35.77 +206.25%
97.00 61.26 -36.85% 13.01 33.07 +154.19%
95.36 55.46 -41.84% 15.45 37.77 +144.47%
97.54 67.03 -31.28% 10.86 31.88 +193.55%
96.34 60.80 -36.89% 14.94 34.05 +127.91%
95.81 60.84 -36.50% 14.94 33.66 +125.30%
spam mean and sdev for all runs
96.77 60.19 -37.80% 12.86 34.77 +170.37%
ham/spam mean difference: 96.49 60.19 -36.30
filename: base bycount
ham:spam: 5280:1300 5280:1300
fp total: 4 0
fp %: 0.08 0.00
fn total: 11 258
fn %: 0.85 19.85
unsure t: 101 660
unsure %: 1.53 10.03
real cost: $71.20 $390.00
best cost: $53.00 $147.60
h mean: 0.28 0.00
h sdev: 3.81 0.03
s mean: 96.77 60.19
s sdev: 12.86 34.77
mean diff: 96.49 60.19
k: 5.79 1.73
Overall, since I have a lot more ham than spam now, when computing initial
spamprob guess by raw counts instead of by corpus-relative ratios everything
ends up looking hammier; if I had a lot more spam than ham instead,
everything would end up looking spammier. As a result of everything looking
hammier, the ham and spam means both plummet, the spam variance skyrockets,
there are fewer false positives, almost-astonishingly more false negatives,
and about half the spam scored as unsure:
Ham: 5280 (100.00%) ok, 0 (0.00%) unsure, 0 (0.00%) fp
Spam: 382 (29.38%) ok, 660 (50.77%) unsure, 258 (19.85%) fn
Every ham was classed as ham (no FP, no unsures), but that was at the
expense of only 30% of the spam getting classed as spam, and 20% of it
getting classed as ham.
So, in all, this experiment agreed with what Alex reported earlier:
> basing the prob on the raw counts instead of the ratios is
> an incredibly clearcut loss. Only won twice on the false positives
> (by relatively small margins), but lost EVERY time on the false
> negatives by large amounts.
I should note that this test was run against *all* the email I've received
recently, so it's not that the ham::spam ratio used in the test differed
from what I see in real life.
More information about the spambayes-dev
mailing list