[Spambayes] Watch out for digests...

Sun Dec 14 20:06:03 EST 2003

[Skip Montanaro]
> ...
> This is where I think the synthetic vs. natural tokens thing would be
> interesting.

I'm not sure what's being distinguished here.

> I get lots of Viagra spam, most of which is caught, but in my current
> database, 'viagra' is a hapax.  In fact, it appears I only added it
> very recently.  Here's the evidence header from a message with the
> subject:
>
>     Viagra, Soma, Fioricet, Prescribed Online for Free, Shipped
>     Overnight
>
> which was scored around 12:25 AM today:
>
>     X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
>             'subject:Free': 0.16;

"Free" in a Subject line and "drug" in the body are hammy for you?  Staring
at clues from mistake-based training can be, umm, counter-intuitive <wink>.

>             'store': 0.23; 'next': 0.25; 'list,': 0.30;
>             'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
>             'header:Reply-To:1': 0.64; 'enter': 0.67;
>             'content-type:multipart/alternative': 0.68;
>             'content-type:text/html': 0.74; 'doctors': 0.84;
>             'prescription': 0.84; 'received:103]': 0.84;
>             'received:165.175': 0.84; 'received:175': 0.84;
>             'received:199.249.165.175': 0.84; 'received:249.165.175':
>             0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98
>
> Most of the spammy clues are synthetic tokens related to delivery
> (and are mostly hapaxes), not content.

I'm not sure what's synthetic about these.  Most of your spam clues come
from the email *headers*, but that's fair game.  Note that mining received
headers is disabled by default, so you're getting a pile of clues most
people aren't getting.  Maybe they should.

> My 'train an unsure or false negative, check for spams' method suggests
> this is the case, since training on a single message often pushes several
> other spams about completely different topics into the spam category.

I'm unclear on what's noteworthy about that.  The biz domain is used by lots
of spam, lots of spam has a yahoo.com return address, lots of spam is
multipart/alternative HTML, and so on.  Looks like you're generating 4
correlated clues from a single Received header, and that you got one spam
before from the same box.  Strangely, though, it looks like you're sucking
out *suffixes* of IP addrs instead of prefixes (you've got

    199.249.165.175
        249.165.175
            165.175
and
                175

but not the almost-surely more useful

    199.249.165
    199.249
and
    199
).

> This suggests a couple other downsides to minimalist training.  One,
> spammers have to move, so hapaxes related to delivery are likely to
> only be useful for a short period while the spammer is abusing a
> single account.

IP *prefixes* should be useful despite that, due to the way IP space is
handed out.  If you're a spammer with a cooperative host, you're likely to
get other IP addresses from the netblocks assigned to that host, and they'll
share a common prefix.

> Two, if a delivery token pushes a bunch of other messages into the
> spam category which are then never used as inputs to training, the
> opportunity to reinforce that token's quality is lost, even though it
> might actually appear fairly frequently in spam.

I expect 'subject:Free' was a fine example of that.