[Spambayes] How low can you go?

Tim Peters tim.one at comcast.net
Sat Dec 13 18:02:22 EST 2003

[Skip Montanaro]
> Okay, time for a little contest.  We've recently seen several users
> tout the size of their training database.  I used to be one of those
> "enlarged database" types, but no more.


> At the moment I have trained on 14 spams and 20 hams and am quite
> pleased with how its performing so far.


> So, how small is yours? <wink>

If size is measured by total # of messages trained on, a different approach
could probably work better.  The original goal of this project was filtering
high-volume mailing lists, on high-end server-class machines, and there was
really no limit on how large a training set was acceptable, nor any
difficulty in getting any number of ham and spam to train on.

Schemes that consider bigrams too (or even larger units, like Bill's CRM114)
learn faster, meaning that my tests showed they achieved accuracy comparable
to our current scheme after training on fewer total messages (although in
the versions of CRM114-like tokenizing we tried, performance plummeted after
hash collisions became "too common").  Given the way we store token
statistics, though, they can create very much larger databases (so large
that even a high-end box would struggle to keep up).

Since the size of database required under our current scheme grows mildly
(in comparison) as the # of training messages increase, and experiments
showed that it did at least as well as any alternative tried given enough
training data, there was no good reason to switch.

But if you're actively looking to minimize the # of messages trained on, the
higher database growth rate of other schemes isn't as important.

One scheme I recommend trying is the mixed unigram/bigram scheme, with the
special scoring gimmick Gary Robinson suggested.  I *think* Tony Meyer has
an up-date-patch for that, but I may be wrong.

A difficulty with enlarging "the source window" we look at when generating
tokens is that it creates highly correlated features.  For example, the

    penis enlargement

generates "penis" and "enlargement", but under the mixed unigram/bigram
scheme it *additionally* creates the single feature "penis enlargement".
All of those are likely to have high spamprobs, of course.

Correlation usually helps us rather than hurts us, but there are exceptions,
and the truth of it relies in large part on that our current scheme rarely
goes out of its way to create correlated features.  Mixing unigrams and
bigrams does go out of its way, and, indeed, creates nothing but correlated
features.  Correlated features have been implicated time and again in the
rare "spectacular failures" we see, so it's scary to create more of them.

That's where Gary Robinson's "special scoring gimmick" comes in, a way to
*count* no more than one feature per source token when scoring.  In the
example, it might decide to score "penis enlargement" as a single feature,
but, if it did, it would *not* also feed the spamprobs of "penis" and
"enlargement" into the final score; or it might decide to feed the spamprobs
of both constituent words into the final score, in which case it would leave
the spamprob of the bigram out of the score.  In effect, scoring "tiles" the
source with a collection of non-overlapping unigram and bigram features,
picked in such a way as to approximate maximizing the aggregate spamprob
strengths over all possible tilings.

That wasn't tested enough to ensure it achieved what it was after, but it
made a lot of theoretical sense, and worked fine in small preliminary tests.
The point is to get faster learning without increasing the "spectacular
failure" rate (which has always been very small, but isn't 0, and would most
likely get much larger (but still remain "small"!) without a gimmick to
counteract systematic correlation).

More information about the Spambayes mailing list