[spambayes-dev] Re: Idea to re-energize corpus learning

Mon Nov 17 16:13:57 EST 2003

[Martin Stone Davis]
> ...
> So why not soften the blow?  That's what my proposal amounts to:
> achieving some sort of middle ground between the status quo and
> starting over.  After performing a "Soften training SEVERELY" (where
> the counts are all set to their square roots), messages would still
> be classified in more-or-less the same way.

You can't know that without running serious tests, and it sounds like
something tests would prove wrong.  SpamBayes effectively computes spamprobs
from ratios, and sqrt(x)/sqrt(y) = sqrt(x/y):  the effective relative ratios
would also get "square rooted", and that's likely to cause massive changes
in scoring.

"The usual" way (in many fields) to diminish counts that have grown "too
large" is to add 1, then shift right by a bit.  The purpose of adding 1
first is to prevent an original count of 1 from becoming 0.  Other than
that, it's basically "cut all the counts in half".  Then (x/2)/(y/2) = x/y,
so that relative ratios aren't affected (much; counts 2*i+1 and 2*i+2, for
any i >= 0, are both reduced to i+1, so relative ratios can still change
some, and especially for small i).

> However, further training would then be far more effective, since the
> counts would be lower.
>
> Doesn't that sound like a good idea?

If test results say that it is, yes; otherwise no.  A problem with
artificially mangling token counts is that you'll probably lose the ability
to meaningfully untrain a message again (the relationship betwen token
counts and total number of ham and spam trained on is destroyed by reducing
only one of them, but if you reduce the total counts too then you've got
more messages you *could* untrain on than the (reduced) total count believes
is possible; untraining anyway will then lead to worsening inaccuracy until
the reduced total count "goes negative", at which point the code will
probably blow up, or start to deliver pure nonsense results).

> -Martin
>
> P.S. I'm also sure that POPfile learns just as quickly as SpamBayes,
> since they are based on the same principle.

Sorry, but unless you've tested this, you have no basis for such a claim.
May be true, may be false, but "same principle" doesn't determine it a
priori (overlooking that the ways in which SpamBayes and POPfile determine a
category actually have very little in common).