[Spambayes] realization of mistakes...

Tim Peters tim.one@comcast.net
Sat, 28 Sep 2002 17:51:51 -0400


[Josiah Carlson]
> I've finally had a chance to go through the current spambayes code in
> cvs.  I notice that the project looks to have abandoned Paul Graham's
> ideas entirely, as I could not find anything that looked remotely like
> what was offered in his paper.

Read the first 20 lines of classifier.py; there was a "death match" between
the schemes just this week, and the current scheme won.

> I suppose this is not necessarily a good thing or a bad thing,

For this project, it's necessarily a good thing.  Read TESTING.txt:  the
things we do here are driven by the results of solid statistical testing.
Things that don't win are ruthlessly purged, and we have high confidence
that the things that survive are genuine improvements (which doesn't mean
they always are -- prior decisions are revisted endlessly too, for that
reason).

> but through recent developments in pasp (in the last 24 hours), I've
> noticed that it is good to have multiple classifiers to compare to one
> another.   Considering the experimental status of statistical email
> categorization, I believe that the more information one can have aout
> the performance of varying algorithms, the better.  But that could be
> just me.

Nope, that's a solid idea.  A difficulty is that solid testing takes time,
there are hundreds of things that need testing, and at any given time I push
the core algorithms toward what I *guess* has the highest likelihood of
paying off next.  At the moment, that's Gary's "central limit" algorithms
(which aren't mentioned on his web pages, but are discussed in this list's
archives).

> On that note, I've looked through the Gary Robinson implementation that
> is in CVS.  It looks like you have done a complete implementation of
> those algorithms.  I should note that those numbers I gave earlier for
> my Gary Robinson algorithm, was basically a drop-in replacement from
> Paul Graham's (converted for comparison sake):
> P = ((1-p1)*(1-p2)*...*(1-pn))
> Q = (p1*p2*...*pn)
> S = Q / ( P + Q )
>
> to Gary Robinson's:
> P = 1 - ((1-p1)*(1-p2)*...*(1-pn))^(1/n)
> Q = 1 - (p1*p2*...*pn)^(1/n)
> S = .5 +  ((P - Q) / (P + Q)) / 2
>
> I apologize for any confusion as to why the Gary Robinson version I use
> doesn't do so well, it's because I haven't added any of the "Further
> Improvements", which apparently does do alot.

To use Gary's scheme it's best to purge all the artificial biases from
Paul's scheme.  For example, min and max probs of 0.01 and 0.99 must go,
artificially multiplying ham counts by 2 must go, ignoring words until
2*hamcount + spamcount >= 5 must go, and just over the last day we're
getting evidence that it's *also* good in Gary's scheme to remove the
asymmetry in Paul's scheme wrt how many times to count a word that appears
multiple times in a msg (Paul counts such a beast only once when scoring,
but multiple times when training; in effect, it's Yet Another Bias that, in
his full scheme, appears to partly cancel out the "let's ignore words that
don't appear often" bias).

> I've been contemplating borrowing some classification code from
> classifier.py, but am shying away because I like to keep things simple,

??? Gary's scheme is simpler than Paul's -- it doesn't have a pile of
"mystery knobs" to trip over.  The classifier is quite simple; tokenizer.py
is getting hairy, though.

> mostly because I like to understand what is going on with the math I
> use.

This gets stranger and stranger <wink>.

> Of course that makes comparison between algorithms used and fn/fp
> performance somewhat difficult, but I believe that at least some
> useful information can be gained.

This would be really interesting with a best-effort implementation of both
schemes.  If you follow the suggestion at the top to read the comments,
they'll tell you how to get our last implementation of Paul's scheme; its
performance was very good (and much better than Paul's scheme before
fiddling it).

> If anyone is curious, I decided to continue on the thought path that led
> me to alter the bias in Paul Graham's probability function last evening.
>
>           bias
>             |
> (let ((g (* 2 (or (gethash word good) 0)))
>       (b (or (gethash word bad) 0)))
>    (unless (< (+ g b) 5)
>      (max .01
>           (min .99 (float (/ (min 1 (/ b nbad))
>                              (+ (min 1 (/ g ngood))
>                                 (min 1 (/ b nbad)))))))))
>
> Hams: 5145
> Spams: 747
> Test hams: 62
> Test spams: 122
>
> Using the cutoffs [.5, .6, .7, .8, .9, .95] to find ranges of fn/fp.
>
> If you don't feel like reading the ranges, considering the desireable
> as low as possible fn, and zero fp, on my corpus of
> spam/ham/testspam/testham,

As your corpus grows, you're going to discover that Paul's scheme sometimes
gives a score of exactly 1.0 to ham; at that point, the only way to avoid
false positives is never to call anything spam.  Gary's scheme hasn't been
caught doing that yet, and the people on this list have several corpora with
tens of thousands of msgs.

> it seems that adjusting the per-word probability function bias to some
> number in the range of 1-1.5 produces both a minimum number of false
> negatives, and zero false positives with basic Paul Graham.
>
> What is also quite interesting is that just a plain weighted average
> performs (on the most part), somewhere between PaulGraham and
> GaryRobinson.

Note that there's a large literature on how best to combine multiple
schemes.  AdaBoost has been mentioned before, but nobody here has persued it
yet (it would be interesting in at least two ways:  combining multiple
"whole cloth" schemes, and viewing each spamprob value as an independent
mini-classifier in its own right).

> Anyways, I'll stop typing and get around to doing another release with
> what I currently know.
>
>  - Josiah
>
> Bias 2:
> alg   fn range in %   fp range in %
> PG -> [5 - 5]         [0 - 0]
> GR -> [5 - 43]        [0 - 0]
> WA -> [9 - 23]        [0 - 0]
>
> Bias 1.75:
> alg   fn range in %   fp range in %
> PG -> [3 - 3]         [0 - 0]
> GR -> [3 - 43]        [0 - 0]
> WA -> [7 - 20]        [0 - 0]
>
> Bias 1.5:
> alg   fn range in %   fp range in %
> PG -> [2 - 2]         [0 - 0]
> GR -> [2 - 41]        [0 - 0]
> WA -> [4 - 19]        [0 - 0]
>
> Bias 1.25:
> alg   fn range in %   fp range in %
> PG -> [1 - 1]         [0 - 0]
> GR -> [1 - 35]        [0 - 0]
> WA -> [2 - 16]        [0 - 0]
>
> Bias 1:
> alg   fn range in %   fp range in %
> PG -> [1 - 1]         [3 - 3]
> GR -> [1 - 23]        [3 - 0]
> WA -> [2 - 12]        [10 - 0]

For reasons explained in TESTING.txt, results obtained from a single test
set don't carry any weight here, beyond suggesting that it may be worthwhile
to run a rigorous multi-corpora test.  In the universe of all emails, your
test data is a very small sample, and so you can never be sure whether
you're observing improvements in your algorithm, or just observing random
quirks in your data.  You could get much more confidence in your results,
even with this small amount of data, by adapting the cross-validation
testing framework used in mboxtest.py and timcv.py.  That slices-and-dices
your data N ways, training on N/(N-1) of it and predicting the remaining 1/N
of it, N times.  Sometimes after making a change, you'll find that half the
runs win and the other half lose -- in such a case, it's hard to sell that
the change made a real difference in either direction.  But also in such a
case, had you run only one test, you would have had a 1 in 2 chance of
erroneously concluding that the change helped -- or that the change hurt.
We're taking great pains to avoid such traps here.