[spambayes-dev] Using crm114-style hash files

Tue Jul 15 20:00:15 EDT 2003

[Dobes Vandermeer]
> Hearing all this talk of 27MB databases

Note that the 27MB database discussed recently used a thoroughly
inappropriate database backend, one that consumes 128 *times* more bytes
than really necessary.  The pickled version of my at-home classifer weighs
in at 1.5MB today (that's its size on disk), and saving some bytes in that
wouldn't do a thing for me.

> makes we want to suggest> (sorry, I'm not a python hacker - yet - or I'd
> submit a file instead) trying crm114-style hash files.

Alex gave you a good summary of our previous experience.  I'll add that
we've gotten worse results out of every attempt to use hashing to bound the
database size, whether or not we also tried CRM114-style wildcard token
generation.  All uses of hashing eventually lead to very bad false
positives, where "very bad" means not just that it was ham that scored as
spam, but also that to the human eye the result was maddeningly
incomprehensible:  there was nothing in the FP that looked spammy at all.
Such occurrences were rare, but any scheme that allows mistaking one token
for another is doomed to bump into them on occasion.  Especially on short
msgs, it only takes a couple "unlucky" token pairs to destroy a correct
result -- process enough short messages, and that's inevitable.

OTOH, CRM114-like wildcard tokenization learned quicker than what we've got,
meaning it got more good out of fewer training msgs at the start.  As the
amount of training data increased, its advantage diminished, and eventually
reversed.  Boosting the size of the hash table gave it an advantage again,
which again diminished and eventually reversed as the amount of training
data again increased.  This cycle continued across several hash-table
boosts, at which point my Python in-memory emulation of a disk-based hash
table ran out of VM.

I expect Bill would agree that "overtraining" a CRM114-style system is a
disaster:  each msg produces many tokens, and well before you train on tens
of thousands of msgs, a million hash slots each limited to a count of 255
plain saturates, and performance drops like a rock (counts become
increasingly meaningless as they all head toward 255, and hash collisions
become the rule rather than the exception; there are effective standard
techniques for rescaling the counts, but a high collision rate is hard bad
news).