# [Spambayes] Getting rid of max_spamprob and min_spamprob

Neil Schemenauer nas@python.ca
Sun, 15 Sep 2002 16:13:03 -0700

```I don't like the max_spamprob and min_spamprob limits.  I've written a
version of spamprob() that uses long integers, does not clamp the
probabilities and uses all evidence.

def spamprob(self, wordstream, evidence=False):
wordinfoget = self.wordinfo.get
numerator = denominator = 1L
nham = self.nham
nspam = self.nspam
for word in Set(wordstream):
record = wordinfoget(word)
if record is None:
continue
hamcount = record.hamcount
spamcount = record.spamcount
if record.hamcount == 0:
numerator *= nspam
denominator *= (nham + 1) * spamcount
elif record.spamcount == 0:
numerator *= (nspam + 1) * hamcount
denominator *= nham
else:
numerator *= nspam * hamcount
denominator *= nham * spamcount
real, frac = divmod(numerator, denominator)
huge = 1L<<30
if real > 0:
if real > huge:
prob = 0.0
else:
prob = 1.0 / (real + 1.0)
else:
if frac > huge:
prob = 1.0
else:
prob = frac / (1.0 + frac)
if evidence:
return (prob, [])
else:
return prob

The results are interesting, IMHO.  First the rate summary:

total unique false pos 113
total unique false neg 0
average fp % 6.27777777778
average fn % 0.0

The fp rate sucks but the fn rate is great.  Here is the histograms for
all runs:

Ham distribution for all runs:
* = 28 items
0.00 1668 ************************************************************
2.50    7 *
5.00    0
7.50    3 *
10.00    0
12.50    0
15.00    0
17.50    0
20.00    1 *
22.50    0
25.00    3 *
27.50    0
30.00    0
32.50    2 *
35.00    0
37.50    0
40.00    0
42.50    0
45.00    0
47.50    0
50.00    3 *
52.50    0
55.00    0
57.50    0
60.00    0
62.50    0
65.00    0
67.50    0
70.00    0
72.50    0
75.00    0
77.50    0
80.00    0
82.50    0
85.00    0
87.50    0
90.00    0
92.50    0
95.00    0
97.50  113 *****

Spam distribution for all runs:
* = 30 items
0.00    0
2.50    0
5.00    0
7.50    0
10.00    0
12.50    0
15.00    0
17.50    0
20.00    0
22.50    0
25.00    0
27.50    0
30.00    0
32.50    0
35.00    0
37.50    0
40.00    0
42.50    0
45.00    0
47.50    0
50.00    0
52.50    0
55.00    0
57.50    0
60.00    0
62.50    0
65.00    0
67.50    0
70.00    0
72.50    0
75.00    0
77.50    0
80.00    0
82.50    0
85.00    0
87.50    0
90.00    0
92.50    0
95.00    0
97.50 1800 ************************************************************

Perhaps there is some way we can swap the two rates by introducing some
bias.

Neil

```