[spambayes-dev] testing tweaks

Thu Aug 7 19:48:11 EDT 2003

[Justin Mason]
> Have you guys considered testing how a tweak effects DB size -- ie.
> including that in the test results output?   I find that's a pretty
> major factor in a lot of cases in SpamAssassin.  </delurk>

I paid a lot of attention in the early days, since I was running training
sets with tens of thousands of messages, and used an entirely in-memory
Python dict to hold all the stats.

Most gimmicks didn't make a difference worth noting.  There is one hack in
our tokenizer to reduce database size:  tokens exceeding 12 characters are
replaced by a synthesized token just recording the first character, and
floor(len(token)/10)*10.  Testing showed that recording "long tokens" in
full didn't make any difference to results, but bloated the database with
many fat hapaxes.

In effect, then, no matter what the other tokenization gimmicks, we don't
create tokens with more than 12 characters, and create a number of tokens
approximately equal to the number of non-whitespace runs in the message.

The option replace_nonascii_chars is also very effective at reducing
database size (it replaces each high-bit and control byte with a question
mark), and actually helps English-speaking users nail Asian spam.  It would
also presumably murder Asian ham, but that's not a problem I have <wink>.
That option is off by default in the codebase, but on by default in the
Outlook addin.

Other gimmicks we don't use had huge effects on database size.  Character
5-grams were murder on database size.  They also performed worse, so
dropping them was no pain.  Schemes also looking at token pairs (bigrams)
more than doubled the database size.

If I ever get time for it, I'd like to pursue a specific mixed
unigram-bigram scheme worked out with Gary Robinson.  For example, given
"penis size", that can be viewed as a bigram, or as two unigrams, or as two
unigrams *and* a bigram.  The last choice isn't so good because it
systematically creates highly correlated clues, which leads to mistakes that
don't make sense to a human eye (I'll claim that experienced spambayes users
are sympathetic to the mistakes it makes -- spambayes judgments are
"intuitive", in some real sense).  But with enough effort, it's possible to
"tile" a message with non-overlapping unigrams and bigrams, so that each
token contributes to exactly one scored entity.  The trick is to do this in
a way that maximizes the overall strength of the entities that get scored.
So, for example, and simplifying too much, if the bigram "penis size" has a
spamprob closer to 0.0 or 1.0 than either of the unigrams "penis" and
"size", view it as a bigram; but if "penis" has a spamprob closer to 0.0 or
1.0 than "penis size", view it as two unigrams instead.

I only had time to run a few tests on that, and it looked very promising,
learning faster than our current pure-unigram scheme, and doing at least as
well on all error measures.  It was (of course) slower to score, and the
database more than doubled in size.  For my own use, it would have been
worth it, since my personal databases are still relatively tiny (about 1,000
training msgs total), and the code runs too fast for me to notice it now.  I
suspect, but don't know, that this mixed scheme would do significantly
better on short messages.