[Spambayes] realization of mistakes...

Josiah Carlson jcarlson@uci.edu
Sat, 28 Sep 2002 17:36:14 -0700


[Tim Peters]
> Read the first 20 lines of classifier.py; there was a "death match" between
> the schemes just this week, and the current scheme won.

Ahh, a death match.  I'm looking through the different versions of
classifier.py in the archives to look for all the tweaks.

> > I suppose this is not necessarily a good thing or a bad thing,
> 
> For this project, it's necessarily a good thing.  Read TESTING.txt:  the
> things we do here are driven by the results of solid statistical testing.
> Things that don't win are ruthlessly purged, and we have high confidence
> that the things that survive are genuine improvements (which doesn't mean
> they always are -- prior decisions are revisted endlessly too, for that
> reason).

I just read TESTING.txt.  And I agree that one should move on, but
ruthlessly purging old code, which is good for keeping bloat low or
nonexistant, only makes sense if the code is bloated already.  Never be
afraid to comment out the code.

> > but through recent developments in pasp (in the last 24 hours), I've
> > noticed that it is good to have multiple classifiers to compare to one
> > another.   Considering the experimental status of statistical email
> > categorization, I believe that the more information one can have aout
> > the performance of varying algorithms, the better.  But that could be
> > just me.
> 
> Nope, that's a solid idea.  A difficulty is that solid testing takes time,
> there are hundreds of things that need testing, and at any given time I push
> the core algorithms toward what I *guess* has the highest likelihood of
> paying off next.  At the moment, that's Gary's "central limit" algorithms
> (which aren't mentioned on his web pages, but are discussed in this list's
> archives).

That is a legitimate case for chucking old stuff.  It's better to not
have to test all the old stuff that may work well when you have since
moved on...especially when you have too much to test already.

> > On that note, I've looked through the Gary Robinson implementation that
> > is in CVS.  It looks like you have done a complete implementation of
> > those algorithms.  I should note that those numbers I gave earlier for
> > my Gary Robinson algorithm, was basically a drop-in replacement from
> > Paul Graham's (converted for comparison sake):
> > P = ((1-p1)*(1-p2)*...*(1-pn))
> > Q = (p1*p2*...*pn)
> > S = Q / ( P + Q )
> >
> > to Gary Robinson's:
> > P = 1 - ((1-p1)*(1-p2)*...*(1-pn))^(1/n)
> > Q = 1 - (p1*p2*...*pn)^(1/n)
> > S = .5 +  ((P - Q) / (P + Q)) / 2
> >
> > I apologize for any confusion as to why the Gary Robinson version I use
> > doesn't do so well, it's because I haven't added any of the "Further
> > Improvements", which apparently does do alot.
> 
> To use Gary's scheme it's best to purge all the artificial biases from
> Paul's scheme.  For example, min and max probs of 0.01 and 0.99 must go,
> artificially multiplying ham counts by 2 must go, ignoring words until
> 2*hamcount + spamcount >= 5 must go, and just over the last day we're
> getting evidence that it's *also* good in Gary's scheme to remove the
> asymmetry in Paul's scheme wrt how many times to count a word that appears
> multiple times in a msg (Paul counts such a beast only once when scoring,
> but multiple times when training; in effect, it's Yet Another Bias that, in
> his full scheme, appears to partly cancel out the "let's ignore words that
> don't appear often" bias).

I'll have a go at the removal of the biases.

> > I've been contemplating borrowing some classification code from
> > classifier.py, but am shying away because I like to keep things simple,
> 
> ??? Gary's scheme is simpler than Paul's -- it doesn't have a pile of
> "mystery knobs" to trip over.  The classifier is quite simple; tokenizer.py
> is getting hairy, though.

I've been having very good results with just translating all non
alpha-numerics into whitespace, uppercase into lowercase, then using
split().  It has the benefit of (I think) being all C for the translate
and split calls.

I'm also currently only tokenizing words that are 3 <= wlen <= 10, that
way I don't need to strip the subject header of 'SUSPECTEDSPAM', which I
use as a filter-able keyword.

> > mostly because I like to understand what is going on with the math I
> > use.
> 
> This gets stranger and stranger <wink>.

Yes, the desire for being able to understand the software one is working
on is quite strange.  I rather like to understand ALL of it, and not
just pieces of it...but only my code.  If I can understand someone
else's at all, I'm happy.

> > Of course that makes comparison between algorithms used and fn/fp
> > performance somewhat difficult, but I believe that at least some
> > useful information can be gained.
> 
> This would be really interesting with a best-effort implementation of both
> schemes.  If you follow the suggestion at the top to read the comments,
> they'll tell you how to get our last implementation of Paul's scheme; its
> performance was very good (and much better than Paul's scheme before
> fiddling it).

I'm checking it out.


> Note that there's a large literature on how best to combine multiple
> schemes.  AdaBoost has been mentioned before, but nobody here has persued it
> yet (it would be interesting in at least two ways:  combining multiple
> "whole cloth" schemes, and viewing each spamprob value as an independent
> mini-classifier in its own right).

If I have the time, I'll look into it.

> For reasons explained in TESTING.txt, results obtained from a single test
> set don't carry any weight here, beyond suggesting that it may be worthwhile
> to run a rigorous multi-corpora test.  In the universe of all emails, your
> test data is a very small sample, and so you can never be sure whether
> you're observing improvements in your algorithm, or just observing random
> quirks in your data.  You could get much more confidence in your results,
> even with this small amount of data, by adapting the cross-validation
> testing framework used in mboxtest.py and timcv.py.  That slices-and-dices
> your data N ways, training on N/(N-1) of it and predicting the remaining 1/N
> of it, N times.  Sometimes after making a change, you'll find that half the
> runs win and the other half lose -- in such a case, it's hard to sell that
> the change made a real difference in either direction.  But also in such a
> case, had you run only one test, you would have had a 1 in 2 chance of
> erroneously concluding that the change helped -- or that the change hurt.
> We're taking great pains to avoid such traps here.

You make a good point (which is also in the TESTING.txt file *grin*). 
I'll remember that when I report my 'results'.

I really should be getting on with other errands...I've spent at lest 5
hours on this today *gasp*.

 - Josiah