[spambayes-dev] RE: [Spambayes] Watch out for digests...

Thu Dec 11 11:17:47 EST 2003

[Tony]
> This is perhaps a drawback of the minimalist database size
> training strategy.

I think it's a consequence of mistake-based training (and minimal database
size is a (another) consequence of *that*).

> I'm guessing that if you had a larger database, the effect wouldn't
> have been as pronounced?

A mistake in training has smaller effect under TOE (train-on-everything).
The other side of that is that a correctly-trained example also has smaller
effect under TOE.

[Skip]
> Maybe.  At the moment, I have 9768 tokens in my database and 7731 of
> them are hapaxes.  As you suggest, it would appear mistakes can throw
> things off more dramatically,

We're rediscovering the bases for these old mantras:

    Mistake-based training leads to hapax-driven scoring.

    Hapax-driven scoring is brittle.

"brittle" is an antonymn of "robust" <wink>.  But in my personal email life,
I've been very happy with mistake-based training despite its drawbacks.

> but it is also easier to detect.

Heh -- isn't that *because* it throws things off so dramatically <wink>?

> I'd be interested to see what others' hapax fractions are:

I don't think that's the right thing to measure.  There's really nothing in
a database that's interesting on its own, the only thing that matters to
performance is what gets used during *scoring* (everything else just sits
there, passively, the same as if it didn't exist (except for its effect on
database size)).  A message score mostly derived from hapaxes is brittle
because a single contrary training example can change the classifier's view
of a hapax from "hammy" or "spammy" to "neither", and two contrary training
examples can swing it to the other classification.

In the early days, the database kept track of the last time a token was used
in scoring, and the test framework kept track of often each token got used
in scoring.  There isn't an out-of-the-box way to get at that info anymore,
so it's much harder to investigate how mistake-based training leads to
hapax-driven scoring now.

It's not *all* bad, or mistake-based training wouldn't be so effective for
so many of us.  Maybe the clearest example is that the hapaxes found in a
new spam campaign are precisely what let us get away with training one
sample and thereafter catch others from that campaign; in effect, hapaxes
act like a pretty large set of lexical fingerprints in that case.

> ...
> Another interesting thing (I think) might be to investigate the
> importance of synthetic tokens (e.g.: 'url:eweek' or
> 'received:168.10.156') vs. natural tokens (e.g., 'highlight' or
> 'dot') for smaller vs larger databases.  I think one of the reasons
> training a single unsure has a dramatic effect on a bunch of other
> unsure spams is because of all the synthetic tokens they have in
> common due to similar delivery mechanisms (gotta use that account
> before it gets shut down...).  If a spammer spews a bunch of messages
> from ISP A, then gets booted, his next spew will be from somewhere
> else.  I suspect many of the ISP-related synthetic tokens generated
> will only ever be hapaxes, and thus be much more important with a
> small database than with a large one.

It was established before that hapaxes are vital in mistake-based training.
If you want to test that quickly but informally, modify a copy of your
database to throw away all the hapaxes, then live with that reduced database
for a while.  It will probably have a hard time even with the messages it
was originally trained with.