[Spambayes] There Can Be Only One

Tim Peters tim.one@comcast.net
Tue, 24 Sep 2002 20:26:09 -0400


I'm detecting subtle <snort> signs that people mistake the chaos on this
list for chaos in the project.  It is getting out of control, but it's not
there quite yet.

The idea here was always to take competing core ideas, test them against
each other, and kill the loser.  Unfortunately, I had more ideas than I
could possibly make time to test, and Gary generates more ideas than I can
possibly make time to test too.  As a result, we've got 4 core scoring
schemes now, and more options to control them than even I can keep track of.

So, it's time to kill one!  I need your help.  I have no interest in
absolute error rates here, just in which of two specific schemes does better
on your data.  If you have error rates of 30%, and believe your ham and spam
are clean, fine, your input is important (not mentioning Skip by name ...);
indeed, it's *more* important than ever-finer hair-splitting from people
with error rates under 1%.

To avoid distractions, there's only one kind of test run I'll look at for
this:  a 10-fold cross-validation run with exactly 200 ham and 200 spam in
each set.  If you don't have at least 2,000 ham and 2,000 spam, you can't
run this test, and reporting results won't help.  If you have more than 200
in each set, that's fine, run timcv like so

    timcv -n10 --ham=200 --spam=200 -s12345

"12345" is a magic number, and you can pick any you like, but you have to
use the same int each time.  It's folded into a randomization scheme that
picks a reproducible set of 200 messages "at random" from each of your Set
directories.  For those running mboxtest.py, I believe it's capable of doing
something similar already; Jeremy can say for sure.

Don't add new messages to your Sets while running these tests.  Indeed,
don't change anything apart from what's detailed below.  If you find ham in
your spam, or vice versa, do clean it out, and then start over again.

The two combatants in this test are G0 and G1:  our last stab at Graham's
scheme, and Gary's f(w) scheme.  The losing code will be purged from the
codebase and the options controlling it will disappear.  In case of a tie,
Gary's scheme wins, because the code is cleaner and it's much easier to
reason about what it's doing.


How To Run the Graham Scheme
============================
This may be a surprise to you <wink>:  Graham's scheme is still there, and
hasn't changed at all lately.  The way to get it is simply to leave your

[Classifier]

section entirely empty, and force

[TestDriver]
spam_cutoff: 0.90

That's also the default, so if you don't use a customization file at all,
you get the Graham scheme.

What else you put in the [Tokenizer] and [TestDriver] sections doesn't
matter for the purposes of this test -- just pick something, and then leave
it alone.  Whatever evidence you're generating, we just want to see which
scheme does better with it.


How To Run The f(w) Scheme
==========================
This is harder, because we've spent almost no time systematically tuning the
parameters that matter to this scheme.  Part of this test is for you to
figure out and report back on which parameter values work best on your data
in this test.  Start like so:

"""
[Classifier]
use_robinson_combining: True
use_robinson_probability: True
robinson_probability_x: 0.5
robinson_probability_a: 0.5
max_discriminators: 150
robinson_minimum_prob_strength: 0.1

[TestDriver]
spam_cutoff: 0.550
"""

You'll never change use_robinson_combining or use_robinson_probability in
this part of the test.

robinson_probability_x is related to Graham's UNKNOWN_WORD_SPAMPROB, and
I've seen no evidence to suggest that moving it away from 1/2 is going to do
anyone any good.  So if you never get around to fiddling this one, that's
fine by me.  If you can make time to fiddle it, great.

robinson_probability_a is important, and no good tests have been reported on
it.  Potentially interesting values range from 0.0 to infinity.  Try to find
the optimal value for your data (I suspect it will be on the low end of [0,
1], but don't know; but exactly 0.0 may well be a disaster).  The lower it
is, the more extreme the spamprobs given to words that appear in few of your
training msgs.

max_discriminators controls how many "extreme words" the scheme scores.  I
saw no real difference between 150 and 1500 in one brief test, but then I
saw no real difference between 150 and 15 in another brief test.  We need
more results on this.  Again try to find an optimal value for your data.

When robinson_minimum_prob_strength is greater than 0, "bland" words are
ignored.  Calling the value p, all words with spamprob in 0.5-p to 0.5+p
(exclusive of the endpoints) are ignored.  At 0.0, no words are ignored;  at
0.5, all words are ignored.  I conjecture that the best value is below 0.5
<wink>.  The suggested starting point of 0.1 ignores only the really bland
words, and has been a winner according to all people reporting on it so far.
Try to find the *best* value for you.

Finally, and most irritatingly, spam_cutoff is crucial in this scheme --
changing its value by 0.02 can make a run that looks like a loser look like
a winner, or vice versa.  You can stare at your score histograms to figure
out whether boosting this or dropping it would help a run do better.  We
simply don't know anything useful about this setting yet.

Caution #1:  These parameters almost certainly interact with each other.
Tuning all at one time is difficult.  (The Graham scheme has more
interacting parameters, btw; you weren't nearly as aware of that because I
stayed up for a week <0.7 wink> tuning them while you slept -- I can't do
that anymore.)

Caution #2:  If your error rates are already low, it's quite possible that
this test isn't big enough to reveal anything.  But that's interesting too,
in its own charmingly useless way.


Extra Credit
============
After you're done, do it all over again with a different magic int.  Do the
relative results change?  If so, do you need significantly different
parameter settings for best results?


Extra Extra Credit
==================
Repeat the above until you drop from boredom.


FAQ
===
Q. This is a ham/spam ratio of 1.  Is that realistic?
A. We can't test everything at once.

Q. 1800 each of ham & spam is a very small training set.
   Wouldn't it be better to use more training data?
A. We can't test everything at once.

Q. 1800 each of ham & spam is a very large training set.
   Wouldn't it be better to use less training data?
A. We can't test everything at once.

Q. I thought Gary's central-limit ideas were more interesting.
   Why don't we test them instead?
A. We can't test everything at once -- and the little we know
   about tuning f(w) is already much more than we know about
   tuning the c-l schemes.

Q. It occurs to me that our tokenization scheme may get benefit
   from leaving punctuation attached to words *mostly* because
   Graham's scheme only looks at a handful of words, and then,
   e.g., counting "Python" and "Python." and "Python," etc as
   distinct words really helps.  But the f(w) scheme can look
   at any number of words, so maybe a different tokenization
   scheme would do better there.
A. You apparently figured out what I was really thinking about
   while tedioulsy repeating "We can't test everything at once."
   Congratulations!  It's irrelevant to this test, though --
   bring it up later.

Q. Each of the 10 cross-validation runs predicts against only
   200 ham and 200 spam.  That's not very much!  In fact, the
   smallest possible change in an error rate will be 0.5% (1
   message).  That isn't a very sensitive test.  In addition,
A. Let me know when you're ready to stop whining and run the
   tests <wink>.

shake-hands-and-come-out-fighting-ly y'rs  - tim