[Spambayes] Another optimization

Tim Peters tim.one@comcast.net
Wed, 18 Sep 2002 14:08:14 -0400


[T. Alexander Popiel]
> Some weeks ago I independently implemented Graham's filter
> algorithm, based on his essay on the topic.  Since my
> implementation is in perl (please, no flames), I doubt
> my code will be all that helpful... and anyway, I just
> sort of hacked it together.  If you still want to see it
> after that disclaimer, it's available at
> http://wolfskeep.com/mailfilter/ .

Thanks!  I don't hate Perl, although sometimes Perl has hated me <wink>.

> About a week and a half ago, I got a reference to your
> collective work here; I scanned through your code gleaning
> ideas.  I figured I should return the favor. ;-)
>
> There are a few things I was wondering about, though:
>
> * Graham was very specific in describing his tokenizer...
>   and you folks seem to have ignored that description.
>   Instead, you're using split-on-whitespace augmented by
>   a few handcrafted hacks for URLs, addresses, and the like.
>   This puzzles me, since I seem to get better results using
>   the tokenization that Graham suggested.

The answer to every question is "because careful experiments said it worked
better".  This one is discussed in a long tokenizer.py comment block, with
title "How to tokenize?", and some full both-way test results are given
there.

> * Why do you throw out super-long words?

Likewise, but under the section "How big should 'a word' be?".

>   Yes, they potentially increase the database size, but I find that
>   lines of dashes of various lengths to be particularly
>   good spam and non-spam indicators.

We get some of that via the compressed 'skip' tokens we generate for
"super-long" words.

>   Also, some of the worms seem to be caught by recognizing their
>   uuencoded forms.

We throw out uuencoded sections entirely, except to specially tag the
permission mask and pieces of the filename.

>   Since I'm not storing _all_ words encountered (I'm only keeping the
>   probability computations after ignoring the few-occurences words)

(Perhaps not so) Strangely enough, my experiments showed that never ignoring
a word worked better.

>   in the database, keeping the long words doesn't seem to use much space.
>   And disk space is cheap, right?

Heh -- ask Neale here about that <wink>.

Do read our project's TESTING.txt:  I never argue with what the data tells
me, and "good arguments" don't count for beans compared to that -- an idea
genuinely reduces an error rate, or it doesn't, and only the ideas that do
survive.  But as TESTING.txt also warns, ideas aren't independent, and the
evaluation of a specific idea has to be done in the context all other ideas
implemented (for example, you very likely get some extra benefit out of
"super long" words *because* you don't toss uuencoded sections).  In theory,
every previous decision should be revisited from scratch whenever anything
changes.  I'm not married to anything here, but I simply will not make a
change unless testing says it's a winner.

> * One key feature of Graham's algorithm is that it only
>   considers the N most significant words in a message;
>   however, in your cancellation of .01 and .99 words, you
>   seem to throw away the the most significant words in
>   favor of less significant words.  (Yes, I understand
>   that the math for cancellation works out assuming the
>   same set of less significant words is considered...
>   but by throwing out the matching words, I think your
>   code then considers _more_ of the less significant words.)

We cancel out *equal numbers* of 0.01 and 0.99 words, and that's all.
Including them has no effect whatsoever on the computed result, *except*
that including them prevented other words from being considered in Graham's
original scheme.  It was a measurable improvement to do what we do here, but
IMO the scheme remains a crock when dozens of .01 and .99 clues both appear,

>   In my code, instead of cancelling out the .01 and .99
>   words, I inflate the considered words count to include
>   all of these 'maximally significant' words that appear
>   in the message, even if that number is >N.

If you threw out equal numbers of cancelling clues then, it would make no
difference to the result you compute in the end.  Each .99 & .01 pair
contributes a factor of .99*.01 to Graham's numerator, and also to each term
in Graham's denominator:  they cancel out exactly.

>   However, I do _not_ consider additional less significant words if
>   the number of maximally significant words is >N.

And we do, *if* the number of non-cancelling maxprob words is less than N.
The fact is that the scheme is almost totally lost in these cases
regardless, and Graham's combining rule is such that just a few .01`or .99
clues not cancelled pretty much determines the final outcome.  One of my
most stubborn false negatives has over a hundred .01 clues to a couple dozen
.99 clues -- it doesn't matter what we do to combine those, it's always
going to be a false negative without a more fundamental change elsewhere.
This "only look at smoking guns" scoring scheme seems to be systematically
weak when dealing with long and chatty "just folks" spam (of which there is
blessedly little so far <wink>).

>   I have found this to be _much_ more effective; it lowered the
>   FN rate while leaving the FP rate constant, whereas
>   cancelling and considering more of the less significant
>   words increased both FN and FP for me.

I don't expect I'll be able to figure out *exactly* what you do here without
studying your code.  If someone wants to run an experiment on that, cool,
the data will tell us.  At the moment, though, the error rates on my 30,000+
message tests have such small absolute value that I can no longer measure an
improvement when one is made; I can only detect regressions reliably now.

>   If I managed to misread your code, I apologize for the
>   mischaracterization.
>
> My testing is not as rigorous as yours is; in particular,
> I don't retrain using several different sets of training
> data in the course of evaluating my changes.  As a result,
> my experience may be biased. :-/

Well, it's a statistical approach, and on a single run there are non-zero
chances of a bad idea appearing to win, and of a good idea appearing to
lose.  The odds of making such a mistake decreases as the amount of data
used increases, but the chances of drawing wrong conclusions remain non-zero
regardless.  Running multiple tests over multiple data sets is another way
to sanity-check ideas.  The examples of test runs in tokenizer.py show
several cases where, e.g., a specific idea *appeared* to be a win in 3 runs,
but was actually a loss in 17 others.  More common is that an idea wins in 8
runs, loses in 8, and ties on 2.  You really need multiple test runs to have
confidence you're making progress.