[spambayes-dev] RE: [Spambayes] How low can you go?

Thu Dec 18 00:08:39 EST 2003

[Seth Goodman]
> OK, but there are still a couple of potential problems.

Oh, sure -- but testing is the only judge of what works here.

> 1) Let's say the discarded bi-gram occurs in a spam at a later date.
> Though it was only a hapax, it now contributes nothing.

I doubt it matters.  Most text classification systems (this field is more
than 40 years old, BTW) ignore hapaxes entirely, and also ignore tokens that
don't appear in at least *several* distinct training examples (see Paul
Graham's essay, where he carried on that tradition).  We don't ignore
anything, because testing said it worked better not to ignore anything in
this particular task.  It wasn't a killer-strong improvement to pay
attention to everything, but was a statistically significant win.  Good
enough.

Since then, use in real life, unlike our randomized cross-validation
testing, doesn't see messages "at random" at all:  it sees them ordered in
time.  That appears to make a difference, and actually helps us overall.

After some 16 months of watching this algorithm in various tests and in
practice, I've identified only two clear, repeated effects of hapaxes:

1. Good:  When a spam campaign begins, the hapaxes in its first example
   very often help to nail the upcoming variations in that campaign.
   People with small databases using mistake-based training see this
   dramatically, and it's very handy for them in real-life use.  A
   similar effect helps on the ham side, when training (e.g.) on that
   once-per-month HTML newletter from (say) American Century Investments,
   which look very spammy the first time around.  Because legit companies
   pay ad firms small fortunes to establish "brand identity", such
   newletters are typically *stuffed* with hapaxes identifying the
   source.

2. Bad:  Most spam campaigns fizzle out within a month.  The hapaxes
   stick around, though.  Sooner or later an unusual ham comes across
   that just happens to hit a large number of the leftover spam hapaxes,
   then serves as a "spectacular failure" example here.  They're very
   rare, but very unsettling when they occur (well, likely *because*
   they're so rare for most people).

> 2) Let's say we want to train on a spam with the discarded bi-gram.
> It was originally a hapax, so it should now have an occurrence count
> of two.  After training, it again shows up as a hapax.  This is a
> more significant problem.

Based on what evidence?  Token spamprobs are guesses at best, and an
estimated spamprob based on only one or two examples isn't even reliable to
one significant digit.  The difference between seeing something once or
twice doesn't move a spamprob much, either.  So I have to guess that this
effect is so tiny it will be lost in estimation noise.

In early experiments, the database stored more info, and the test framework
was able to report which features were used *most* often in making a correct
decision.  Several times I took the few hundred "most valuable" features
(based on a combination of how often they contributed to a correct decision,
and their spamprob strength (distance, in either direction, from 0.5)), and
threw them out of the database.  An amazing (at the time) thing was that
this didn't hurt performance -- if the classifier was blinded to what *were*
its best clues, it found another set of clues that did just as well overall.
Performance eventually deteriorated dramatically if this was done over and
over again, but the system has already been shown to be very robust against
losing even its best features.  That's one reason I'm not worried about
throwing away its least useful features (hapaxes have weak spamprobs, and
hapaxes that haven't been *used* in scoring for N days may as well not have
existed at all for the last N days -- and most hapaxes are like that, no
matter how big N is).

> 3) Do we eventually reduce the occurrence count of a non-hapax token?

There are many possible schemes.  Strongly storage-conscious schemes only
save a byte or two for a count, and periodically shift all the counts right
by 1 bit, to prevent overflow.  That seems to work very well in systems that
do it.  I've already said here that I see the primary point of expiring
hapaxes as being a means to reduce database size, and in the context of the
much more storage-intensive mixed unigram/bigram scheme.  Hapaxes can
account for the bulk of the storage all by themselves (this isn't unique to
spam filtering, btw -- across many kinds of computer text indexing systems,
hapaxes typically account for about half the content), and most hapaxes are
never seen again.

I'm experimenting with a mixed unigram/bigram classifier right now.  It's
been trained on (just) 94 ham and 96 spam so far, but there are already
51,378 features in the database.  45,624 of them are hapaxes -- that's 89%!
I could eliminate the rest of the database entirely, and not cut its size
enough to care about.  This is why picking specifically on hapaxes is a
high-value proposition (high potential, low risk).

> If we do, we could eventually have none of the tokens from a trained
> message present but its message count will still be there.  Unless we
> implement your token cross-reference as explained below, the message
> counts will eventually not be correct if we expire enough tokens.

I want to do expiration "correctly".  But even if all the tokens from a
message expire when the total message count is N, it still doesn't change
that counts on tokens that remain were in fact derived from N messages, and
so N remains the best possible thing to feed into the spamprob guesses.

> If we don't expire a lot of tokens over the long run, why bother?

I expect an enormous number of hapaxes to expire, in steady state
essentially equaling the rate at which they're created by new messages.  In
the example above, 90% of the features created for me right now *are*
hapaxes.  I expect that to drop with more training, but for hapaxes to
remain both the single biggest database consumer, and the least valuable
tokens to retain.

>> ...
>> There's another bullet we haven't bitten yet, saving a map of
>> message id to an explicit list of all tokens produced by that
>> message (Skip wants the inverse of that mapping for diagnostic
>> purposes too).  Given that, training and untraining of individual
>> messages could proceed smoothly despite intervening changes in
>> tokenization details; expiring entire messages would be
>> straightforward; and when expiring an individual feature, it would
>> be enough to remove that feature from each msg->[feature] list it's
>> in (then untraining on a msg later wouldn't *try* to decrement the
>> per-feature count of any feature that had previously been expired
>> individually and appeared in the msg at the time).

> This definitely works.  But why bother tracking, cross-referencing and
> expiring individual tokens when we can just expire whole messages,
> which is a lot simpler?

I doubt that it's simpler at all, and you earlier today sketched quite an
elaborate scheme for expiring different messages at different rates.  That's
got its share of tuning parameters (aka wild-ass guesses <wink>) too, showed
every sign of being just the beginning of its brand of complication, and has
no testing or experience to support it.  We know a lot about the real-life
effects of hapaxes now.

BTW, the single worst thing you can do with a system of this type is train a
message into the wrong category.  Everyone does it eventually, and some
people can't seem to help but doing it often.  Maybe that's a UI problem at
heart -- I don't know, because I seem to be unusually resistant to it.  It's
happened to me too, though, and it can be hard to recover.  One sterling use
for a feature -> msg_ids map is, as Skip noted, a way to find out *why* your
latest spam was a false negative:  look at the low-scoring features, then
look at the messages with those features that were trained on as ham.  This
has an excellent shot at pinpointing mis-trained messages.  That's difficult
at best now, and is a real problem for some people.  I've got gigabytes of
unused disk space myself <wink>.

Evolution of this system would also be served by saving an explict msg_id ->
features map.  When we change tokenization to get a small win, sometimes the
tokens originally added to a database by training on message M can no longer
be reconstructed by re-tokenizing M (the tokenizer has changed!  if it
always returned exactly what it returned before the change, there wasn't
much point to the change <wink>).  Blindly untraining anyway can violate
database invariants then, eventually manifesting as assertion errors and the
need to retrain from scratch.  The only clear and simple way to prevent this
is to save a map from msg_id to the tokens it originally produced.  Then
untraining simply walks that list, and nothing can go wrong as a result.

That's a bit subtle, so takes some long-term experience to appreciate at a
gut level.  Of more immediate concern to most users is that only the
obsessed *want* to save their spam.  Most people want to throw spam away
ASAP.  But, if they do that, we currently have no way to expire any spam
they ever trained on.  Moving toward saving msg_ids <-> features maps solves
that too, and with suitable reuse of little integers for feature ids can
store the relevant bits about trained messages in less space than it takes
to save the original messages.  Note that hapaxes would waste the most
resource in this context too.

>> That's all easy enough to do, but the database grows ever bigger.
>> It would probably need reworking to start using "feature ids"
>> (little integers) too, so that relatively big strings didn't have to
>> get duplicated all over the database.

> No argument there.  How about a 32-bit hash for any token whether
> unigram, bi-gram, etc.?  The token database could then consist of an
> ordered list of 32-bit hashes paired with an occurrence count
> (16-bits would probably do it).  That's only six bytes/token, and you
> could use your indexing method of choice, if any, to speed up the
> lookups.

We ran experiments on that before, and results were dreadful.  32-bit hashes
have far too high a collision rate on a sizable database (don't forget the
Birthday Paradox here!), confusing ham with spam in highly entertaining ways
(provided you're just experimenting and don't really care how well it does).
An MD5 or SHA-1 hash would be fine, but then it's up to 16 or 20 bytes per
feature, and most of the strings we store in the current pure unigram scheme
are shorter than that.  A 64-bit hash would probably be OK.

Another hated (widely in this project, among the developers) consequence of
using hash codes is that mining the database for clues is useless then.
"Hey, hash code 45485448 is your strongest spam clue!"  "Oh -- no wonder,
then" <wink>.  Storing the actual feature strings as plainly as possible is
extremely helpful for development, debugging, and research.

> Similarly, if we implemented a message database with this method, each
> token in a message would only take up four bytes.  The hash calculation
> costs something, but the smaller database size and quicker lookup time
> could make up for it.

We're not going to abandon plain strings, because they're far too useful and
loved in various reports intended for human consumption.  Adding feature_id
<-> feature_string maps would allow for effective compression of message
storage.