[Spambayes] realization of mistakes...

Josiah Carlson jcarlson@uci.edu
Sat, 28 Sep 2002 11:42:13 -0700


Good day,

I've finally had a chance to go through the current spambayes code in
cvs.  I notice that the project looks to have abandoned Paul Graham's
ideas entirely, as I could not find anything that looked remotely like
what was offered in his paper.

I suppose this is not necessarily a good thing or a bad thing, but
through recent developments in pasp (in the last 24 hours), I've noticed
that it is good to have multiple classifiers to compare to one another. 
Considering the experimental status of statistical email categorization,
I believe that the more information one can have about the performance
of varying algorithms, the better.  But that could be just me.


On that note, I've looked through the Gary Robinson implementation that
is in CVS.  It looks like you have done a complete implementation of
those algorithms.  I should note that those numbers I gave earlier for
my Gary Robinson algorithm, was basically a drop-in replacement from
Paul Graham's (converted for comparison sake):
P = ((1-p1)*(1-p2)*...*(1-pn))
Q = (p1*p2*...*pn)
S = Q / ( P + Q )

to Gary Robinson's:
P = 1 - ((1-p1)*(1-p2)*...*(1-pn))^(1/n)
Q = 1 - (p1*p2*...*pn)^(1/n)
S = .5 +  ((P - Q) / (P + Q)) / 2

I apologize for any confusion as to why the Gary Robinson version I use
doesn't do so well, it's because I haven't added any of the "Further
Improvements", which apparently does do alot.  I've been contemplating
borrowing some classification code from classifier.py, but am shying
away because I like to keep things simple, mostly because I like to
understand what is going on with the math I use.  Of course that makes
comparison between algorithms used and fn/fp performance somewhat
difficult, but I believe that at least some useful information can be
gained.


If anyone is curious, I decided to continue on the thought path that led
me to alter the bias in Paul Graham's probability function last evening.
          bias
            |
(let ((g (* 2 (or (gethash word good) 0)))
      (b (or (gethash word bad) 0)))
   (unless (< (+ g b) 5)
     (max .01
          (min .99 (float (/ (min 1 (/ b nbad))
                             (+ (min 1 (/ g ngood))   
                                (min 1 (/ b nbad)))))))))

Hams: 5145
Spams: 747
Test hams: 62
Test spams: 122

Using the cutoffs [.5, .6, .7, .8, .9, .95] to find ranges of fn/fp.

If you don't feel like reading the ranges, considering the desireable 
as low as possible fn, and zero fp, on my corpus of
spam/ham/testspam/testham, it seems that adjusting the per-word probability
function bias to some number in the range of 1-1.5 produces both a
minimum number of false negatives, and zero false positives with basic
Paul Graham.

What is also quite interesting is that just a plain weighted average
performs (on the most part), somewhere between PaulGraham and
GaryRobinson.

Anyways, I'll stop typing and get around to doing another release with
what I currently know.

 - Josiah

Bias 2:
alg   fn range in %   fp range in %
PG -> [5 - 5]         [0 - 0]
GR -> [5 - 43]        [0 - 0]
WA -> [9 - 23]        [0 - 0]

Bias 1.75:
alg   fn range in %   fp range in %
PG -> [3 - 3]         [0 - 0]
GR -> [3 - 43]        [0 - 0]
WA -> [7 - 20]        [0 - 0]

Bias 1.5:
alg   fn range in %   fp range in %
PG -> [2 - 2]         [0 - 0]
GR -> [2 - 41]        [0 - 0]
WA -> [4 - 19]        [0 - 0]

Bias 1.25:
alg   fn range in %   fp range in %
PG -> [1 - 1]         [0 - 0]
GR -> [1 - 35]        [0 - 0]
WA -> [2 - 16]        [0 - 0]

Bias 1:
alg   fn range in %   fp range in %
PG -> [1 - 1]         [3 - 3]
GR -> [1 - 23]        [3 - 0]
WA -> [2 - 12]        [10 - 0]