[Spambayes] How low can you go?

Tim Peters tim.one at comcast.net
Wed Dec 17 19:13:58 EST 2003


[Seth Goodman]
> Does CRM114 use the number of trained ham and trained spam *messages*
> as variables in its probability calculation?  If not, then you
> wouldn't expect that deleting infrequently used tokens would do much
> damage.  AFAIK, SpamBayes uses the trained message counts in the
> probability calculation

Yes.

> and those becomes inaccurate if you delete individual tokens.

No, it doesn't matter if that's *all* you do.  Say I've trained on 243 ham,
and 257 spam, total, and throw out the hapax 'bi:choose the'.  That has no
effect on that the features I didn't throw out still came from training on
243 ham and 257 spam, total.

The problem comes when untraining a message M.  That reduces the count of
total messages trained on, but if I threw away a hapax H from M previously,
and H reappeared again later, it would be a mistake to reduce the category
count on H during untraining M.

There's another bullet we haven't bitten yet, saving a map of message id to
an explicit list of all tokens produced by that message (Skip wants the
inverse of that mapping for diagnostic purposes too).  Given that, training
and untraining of individual messages could proceed smoothly despite
intervening changes in tokenization details; expiring entire messages would
be straightforward; and when expiring an individual feature, it would be
enough to remove that feature from each msg->[feature] list it's in (then
untraining on a msg later wouldn't *try* to decrement the per-feature count
of any feature that had previously been expired individually and appeared in
the msg at the time).

That's all easy enough to do, but the database grows ever bigger.  It would
probably need reworking to start using "feature ids" (little integers) too,
so that relatively big strings didn't have to get duplicated all over the
database.




More information about the Spambayes mailing list