[spambayes-dev] RE: [Spambayes] How low can you go?

Wed Dec 17 20:41:30 EST 2003

[Tim Peters]
> No, it doesn't matter if that's *all* you do.  Say I've trained
> on 243 ham,
> and 257 spam, total, and throw out the hapax 'bi:choose the'.  That has no
> effect on that the features I didn't throw out still came from training on
> 243 ham and 257 spam, total.

OK, but there are still a couple of potential problems.

1) Let's say the discarded bi-gram occurs in a spam at a later date.  Though
it was only a hapax, it now contributes nothing.

2) Let's say we want to train on a spam with the discarded bi-gram.  It was
originally a hapax, so it should now have an occurrence count of two.  After
training, it again shows up as a hapax.  This is a more significant problem.

3) Do we eventually reduce the occurrence count of a non-hapax token?  If we
do, we could eventually have none of the tokens from a trained message
present but its message count will still be there.  Unless we implement your
token cross-reference as explained below, the message counts will eventually
not be correct if we expire enough tokens. If we don't expire a lot of
tokens over the long run, why bother?

>
> The problem comes when untraining a message M.  That reduces the count of
> total messages trained on, but if I threw away a hapax H from M
> previously,
> and H reappeared again later, it would be a mistake to reduce the category
> count on H during untraining M.

Yup, and you have the solution below.

>
> There's another bullet we haven't bitten yet, saving a map of
> message id to
> an explicit list of all tokens produced by that message (Skip wants the
> inverse of that mapping for diagnostic purposes too).  Given
> that, training
> and untraining of individual messages could proceed smoothly despite
> intervening changes in tokenization details; expiring entire
> messages would
> be straightforward; and when expiring an individual feature, it would be
> enough to remove that feature from each msg->[feature] list it's in (then
> untraining on a msg later wouldn't *try* to decrement the
> per-feature count
> of any feature that had previously been expired individually and
> appeared in
> the msg at the time).

This definitely works.  But why bother tracking, cross-referencing and
expiring individual tokens when we can just expire whole messages, which is
a lot simpler?  It accomplishes the goal of keeping the token databases
cleaned of excessive hapaxes and gradually expires non-hapax tokens, as
well.  There is also less need for reverse indexing of tokens to messages,
since all messages and their tokens will eventually expire.  However, if
people need that feature, they need it.

>
> That's all easy enough to do, but the database grows ever bigger.
>  It would
> probably need reworking to start using "feature ids" (little
> integers) too,
> so that relatively big strings didn't have to get duplicated all over the
> database.

No argument there.  How about a 32-bit hash for any token whether unigram,
bi-gram, etc.?  The token database could then consist of an ordered list of
32-bit hashes paired with an occurrence count (16-bits would probably do
it).  That's only six bytes/token, and you could use your indexing method of
choice, if any, to speed up the lookups.  Similarly, if we implemented a
message database with this method, each token in a message would only take
up four bytes.  The hash calculation costs something, but the smaller
database size and quicker lookup time could make up for it.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above