[Spambayes] all but one testing

Tim Peters tim.one@comcast.net
Thu, 05 Sep 2002 18:20:33 -0400

[Neil Schemenauer]
> I've written a driver script the does "all but one testing".  The basic
> algorithm is:
>     gb = GrahamBayes()
>     for msg in spam:
>         gb.learn(msg, is_spam=True)
>     for msg in ham:
>         gb.learn(msg, is_spam=False)
>     for msg in spam:
>         gb.unlearn(msg, is_spam=True)
>         gb.spamprob(msg)
>         gb.lear(msg, is_spam=True)
>     for msg in ham:
>         gb.unlearn(msg, is_spam=False)
>         gb.spamprob(msg)
>         gb.lear(msg, is_spam=False)
>     print summary
> Is this type of testing useful?

It's sure better than nothing <wink>.  Also better than nothing, but not as
good, is doing the same thing but skipping the learn/unlearn calls after
initial training.

> As understand it, it's most useful when you have a small amount of testing
> and training data.

I've run no experiments on training set size yet, and won't hazard a guess
as to how much is enough.  I'm nearly certain that the 4000h+2750s I've been
using is way more than enough, though.  It's a question of practical
importance open for fresh triumphs <wink>.

> That doesn't seem> to be a problem for us.  Also, it's really slow.

Each call to learn() and to unlearn() computes a new probability for every
word in the database.  There's an official way to avoid that in the first
two loops, e.g.

    for msg in spam:
        gb.learn(msg, True, False)

In each of the last two loops, the total # of ham and total # of spam in the
"learned" set is invariant across loop trips, and you *could* break into the
abstraction to exploit that:  the only probabilities that actually change
across those loop trips are those associated with the words in msg.  Then
the runtime for each trip would be proportional to the # of words in the msg
rather than the number of words in the database.

Another area for potentially fruitful study:  it's clear that the
highest-value indicators usually appear "early" in msgs, and for spam
there's an actual reason for that:  advertising has to strive to get your
attention early.  So, for example, if we only bothered to tokenize the first
90% of a msg, would results get worse?  I doubt it.  And if not, what about
the first 50%?  The first 10%?  The first 1000 bytes?  max(1000 bytes, first
10%)?  That could also yield a major speed boost, and *may* even improve
results -- e.g., sometimes an on-topic message starts well but then rambles.