[spambayes-dev] RE: [Spambayes] Watch out for digests...
Tim Peters
tim.one at comcast.net
Sun Dec 14 20:06:03 EST 2003
[Skip Montanaro]
> ...
> This is where I think the synthetic vs. natural tokens thing would be
> interesting.
I'm not sure what's being distinguished here.
> I get lots of Viagra spam, most of which is caught, but in my current
> database, 'viagra' is a hapax. In fact, it appears I only added it
> very recently. Here's the evidence header from a message with the
> subject:
>
> Viagra, Soma, Fioricet, Prescribed Online for Free, Shipped
> Overnight
>
> which was scored around 12:25 AM today:
>
> X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
> 'subject:Free': 0.16;
"Free" in a Subject line and "drug" in the body are hammy for you? Staring
at clues from mistake-based training can be, umm, counter-intuitive <wink>.
> 'store': 0.23; 'next': 0.25; 'list,': 0.30;
> 'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
> 'header:Reply-To:1': 0.64; 'enter': 0.67;
> 'content-type:multipart/alternative': 0.68;
> 'content-type:text/html': 0.74; 'doctors': 0.84;
> 'prescription': 0.84; 'received:103]': 0.84;
> 'received:165.175': 0.84; 'received:175': 0.84;
> 'received:199.249.165.175': 0.84; 'received:249.165.175':
> 0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98
>
> Most of the spammy clues are synthetic tokens related to delivery
> (and are mostly hapaxes), not content.
I'm not sure what's synthetic about these. Most of your spam clues come
from the email *headers*, but that's fair game. Note that mining received
headers is disabled by default, so you're getting a pile of clues most
people aren't getting. Maybe they should.
> My 'train an unsure or false negative, check for spams' method suggests
> this is the case, since training on a single message often pushes several
> other spams about completely different topics into the spam category.
I'm unclear on what's noteworthy about that. The biz domain is used by lots
of spam, lots of spam has a yahoo.com return address, lots of spam is
multipart/alternative HTML, and so on. Looks like you're generating 4
correlated clues from a single Received header, and that you got one spam
before from the same box. Strangely, though, it looks like you're sucking
out *suffixes* of IP addrs instead of prefixes (you've got
199.249.165.175
249.165.175
165.175
and
175
but not the almost-surely more useful
199.249.165
199.249
and
199
).
> This suggests a couple other downsides to minimalist training. One,
> spammers have to move, so hapaxes related to delivery are likely to
> only be useful for a short period while the spammer is abusing a
> single account.
IP *prefixes* should be useful despite that, due to the way IP space is
handed out. If you're a spammer with a cooperative host, you're likely to
get other IP addresses from the netblocks assigned to that host, and they'll
share a common prefix.
> Two, if a delivery token pushes a bunch of other messages into the
> spam category which are then never used as inputs to training, the
> opportunity to reinforce that token's quality is lost, even though it
> might actually appear fairly frequently in spam.
I expect 'subject:Free' was a fine example of that.
More information about the spambayes-dev
mailing list