[Spambayes] How low can you go?

Tue Dec 16 21:40:53 EST 2003

[Tim, on the spambayes list, about x-use_bigrams in CVS]
> I see that it's a cruder approximation to the suggested scoring
> algorithm (which I implemented at one time).  For example ...

I checked in the intended implementation.  Here's the checkin comment:

    Implemented the intended "tiling" version of x-use_bigrams.  Tried
    to restore most of the speed lost when this option *isn't* in use.

    Will add comments later.

    Anyone using x-use_bigrams needs to retrain:  synthesized bigrams
    now begin with a "bi:" prefix.

Skip, that last point addresses your (good!) concern about ambiguity wrt the
special 'saved state' key.

Here's what I've found so far.  My main personal database is currently
trained on 474 ham and 489 spam, using mostly mistake-and-unsure-based
training, with a spam cutoff of 95 and a ham cutoff of 4 (yup, those are
extreme -- I've been experimenting).

Database size (a bsddb3 hash database):

    without x-use_bigrams   2,544KB
    with x-use_bigrams     10,288KB

That's a major size boost, and (of course) is expected (bigrams create fat
hapaxes at a prodigious rate).

There's no reason to suppose that the selection of training ham and spam
based on mistake-and-unsure training from a unigram-only classifier makes
much sense for a mixed uni+bi-gram classifier; to the contrary, the latter
almost certainly has different strengths and weaknesses.

An example of that is the highest scoring ham in my inbox.  Because I had
previously put copies of some of those into my ham training data, back when
my ham cutoff was 20, without x-use_bigrams no message in my inbox today
scores above 20.  These are the worst:

     6  6  6  7  7  7  7  8  8  8  9  9  9 12 13 13 14 16

After retraining on the same training sets with x-use_bigrams, then
rescoring my inbox, the highest-scoring ham in my inbox are worse:

     7  8  8  9 10 12 13 13 13 13 16 22 25 31 34 38 45 49

I'm confident that this is an artifact of using training sets based on
picking on the weakest performance of a different scoring strategy, and that
had I been using train-on-everything all along, that result would have been
very different.

There's an interesting example in the other direction too:  the last time I
started over from scratch, I left one Unsure in my Unsure folder, and have
kept it there ever since.  It's a long and chatty spam, about a topic I even
have some interest in (no, my wang already has carpet burns <wink>), and I
wanted to see how mistake-based training changed its score over time.  It
drifted slowly upward all along, from the low 40s to the low 80s.  Under
x-use_bigrams, though, the score zoomed to 95.34.

The difference is high-scoring bigrams that appeared in a few other spam:

'bi:any questions,'                 0.908163            0      2
'bi:website at:'                    0.908163            0      2
'bi:visit our'                      0.931987            1     17
'bi:create your'                    0.934783            0      3
'bi:than years'                     0.934783            0      3

"than years" is a peculiar one, eh?!  Then original text was

    ... more than 30 years ago ...

and we skipped "30" because it's shorter than 3 characters.

So, conclusions for now:

+ x-use_bigrams is going to bloat your database bigtime.

+ If you use train-on-everything, and want to try it, no problem.

+ If you're doing mistake-based training and want to try it, probably
  best to start over from scratch.

+ I believe that mistake-based training under this method is likely
  to be substantially more brittle than mistake-based training under
  the (still default) unigram-only scheme, because it's even more
  hapax-driven (synthesizing bigrams creates many more hapaxes).

+ OTOH, bigrams are better at recognizing the language of advertising.
  For example, "bi:website at:" is more clearly a "call to action" than
  either "website" or "at:".