[Spambayes] Getting rid of max_spamprob and min_spamprob
Neil Schemenauer
nas@python.ca
Sun, 15 Sep 2002 16:13:03 -0700
I don't like the max_spamprob and min_spamprob limits. I've written a
version of spamprob() that uses long integers, does not clamp the
probabilities and uses all evidence.
def spamprob(self, wordstream, evidence=False):
wordinfoget = self.wordinfo.get
numerator = denominator = 1L
nham = self.nham
nspam = self.nspam
for word in Set(wordstream):
record = wordinfoget(word)
if record is None:
continue
hamcount = record.hamcount
spamcount = record.spamcount
if record.hamcount == 0:
numerator *= nspam
denominator *= (nham + 1) * spamcount
elif record.spamcount == 0:
numerator *= (nspam + 1) * hamcount
denominator *= nham
else:
numerator *= nspam * hamcount
denominator *= nham * spamcount
real, frac = divmod(numerator, denominator)
huge = 1L<<30
if real > 0:
if real > huge:
prob = 0.0
else:
prob = 1.0 / (real + 1.0)
else:
if frac > huge:
prob = 1.0
else:
prob = frac / (1.0 + frac)
if evidence:
return (prob, [])
else:
return prob
The results are interesting, IMHO. First the rate summary:
total unique false pos 113
total unique false neg 0
average fp % 6.27777777778
average fn % 0.0
The fp rate sucks but the fn rate is great. Here is the histograms for
all runs:
Ham distribution for all runs:
* = 28 items
0.00 1668 ************************************************************
2.50 7 *
5.00 0
7.50 3 *
10.00 0
12.50 0
15.00 0
17.50 0
20.00 1 *
22.50 0
25.00 3 *
27.50 0
30.00 0
32.50 2 *
35.00 0
37.50 0
40.00 0
42.50 0
45.00 0
47.50 0
50.00 3 *
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 113 *****
Spam distribution for all runs:
* = 30 items
0.00 0
2.50 0
5.00 0
7.50 0
10.00 0
12.50 0
15.00 0
17.50 0
20.00 0
22.50 0
25.00 0
27.50 0
30.00 0
32.50 0
35.00 0
37.50 0
40.00 0
42.50 0
45.00 0
47.50 0
50.00 0
52.50 0
55.00 0
57.50 0
60.00 0
62.50 0
65.00 0
67.50 0
70.00 0
72.50 0
75.00 0
77.50 0
80.00 0
82.50 0
85.00 0
87.50 0
90.00 0
92.50 0
95.00 0
97.50 1800 ************************************************************
Perhaps there is some way we can swap the two rates by introducing some
bias.
Neil