[Spambayes] How low can you go?

Wed Dec 17 11:10:55 EST 2003

    Tim> Database size (a bsddb3 hash database):

    Tim>     without x-use_bigrams   2,544KB
    Tim>     with x-use_bigrams     10,288KB

    Tim> That's a major size boost, and (of course) is expected (bigrams
    Tim> create fat hapaxes at a prodigious rate).

I've been experimenting with the bigram stuff and like it so far.  I also
have some mods to the DBDictClassifier stuff which add timestamps (last set,
last used) to the database.  There's some interaction between the two which
keeps me from using the two together.  It may be worthwhile considering a
last used timestamp to control the number of unused (or rarely used) tokens.

The first thing I did was retrain and then score my then current unsure
mailbox.  Out of about 40 messages it scored over half of them as spam with
bigrams enabled.  I then took my entire training database (around 140 spams
and 100 hams) and tossed them into my unsure mailbox.  Using that now much
bigger mailbox (about 280 messages), I then started with a fresh round of
unsure+mistake based training.  I got to roughly the same performance as
without bigrams using a much smaller set of training messages.  I'm
currently at 97 spams and 64 hams.  I'm still getting a fair number of
unsures, but the false positive rate doesn't seem horrible (I've seen a few,
but haven't been counting).

    Tim> + I believe that mistake-based training under this method is likely
    Tim>   to be substantially more brittle than mistake-based training
    Tim>   under the (still default) unigram-only scheme, because it's even
    Tim>   more hapax-driven (synthesizing bigrams creates many more
    Tim>   hapaxes).

As I was training, I noticed some wild fluctuations in scores with bigrams
enabled, especially with small databases.

Skip