[spambayes-dev] RE: [Spambayes] Watch out for digests...

Thu Dec 11 11:38:28 EST 2003

    >> I'd be interested to see what others' hapax fractions are:

    Tim> I don't think that's the right thing to measure.  There's really
    Tim> nothing in a database that's interesting on its own, the only thing
    Tim> that matters to performance is what gets used during *scoring*
    Tim> (everything else just sits there, passively, the same as if it
    Tim> didn't exist (except for its effect on database size)).  

Yes, you're correct, of course.  So what we might want to look at is the
relative occurrence of 0.84 and 0.16 scores in message clues?

    Tim> It's not *all* bad, or mistake-based training wouldn't be so
    Tim> effective for so many of us.  Maybe the clearest example is that
    Tim> the hapaxes found in a new spam campaign are precisely what let us
    Tim> get away with training one sample and thereafter catch others from
    Tim> that campaign; in effect, hapaxes act like a pretty large set of
    Tim> lexical fingerprints in that case.

This is where I think the synthetic vs. natural tokens thing would be
interesting.  I get lots of Viagra spam, most of which is caught, but in my
current database, 'viagra' is a hapax.  In fact, it appears I only added it
very recently.  Here's the evidence header from a message with the subject:

    Viagra, Soma, Fioricet, Prescribed Online for Free, Shipped Overnight

which was scored around 12:25 AM today:

    X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
            'subject:Free': 0.16; 'store': 0.23; 'next': 0.25; 'list,': 0.30;
            'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
            'header:Reply-To:1': 0.64; 'enter': 0.67;
            'content-type:multipart/alternative': 0.68;
            'content-type:text/html': 0.74; 'doctors': 0.84;
            'prescription': 0.84; 'received:103]': 0.84;
            'received:165.175': 0.84; 'received:175': 0.84;
            'received:199.249.165.175': 0.84; 'received:249.165.175': 0.84;
            'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98

Most of the spammy clues are synthetic tokens related to delivery (and are
mostly hapaxes), not content.  My 'train an unsure or false negative, check
for spams' method suggests this is the case, since training on a single
message often pushes several other spams about completely different topics
into the spam category.

This suggests a couple other downsides to minimalist training.  One,
spammers have to move, so hapaxes related to delivery are likely to only be
useful for a short period while the spammer is abusing a single account.
Two, if a delivery token pushes a bunch of other messages into the spam
category which are then never used as inputs to training, the opportunity to
reinforce that token's quality is lost, even though it might actually appear
fairly frequently in spam.

Skip