[Spambayes] Watch out for digests...
skip at pobox.com
Sun Dec 14 21:13:43 EST 2003
>> X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
>> 'subject:Free': 0.16;
Tim> "Free" in a Subject line and "drug" in the body are hammy for you?
Tim> Staring at clues from mistake-based training can be, umm,
Tim> counter-intuitive <wink>.
Yeah, one of the online communities I participate in is a list of parents of
"troubled kids", hence the hammy "drug" reference. "subject:Free" comes
from the music community:
Subject: SFS Special Announcement (Free Guest List to Fluid this Friday)
>> 'store': 0.23; 'next': 0.25; 'list,': 0.30;
>> 'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
>> 'header:Reply-To:1': 0.64; 'enter': 0.67;
>> 'content-type:multipart/alternative': 0.68;
>> 'content-type:text/html': 0.74; 'doctors': 0.84;
>> 'prescription': 0.84; 'received:103]': 0.84;
>> 'received:165.175': 0.84; 'received:175': 0.84;
>> 'received:18.104.22.168': 0.84; 'received:249.165.175':
>> 0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98
>> Most of the spammy clues are synthetic tokens related to delivery
>> (and are mostly hapaxes), not content.
Tim> I'm not sure what's synthetic about these.
I guess my operational definitions of "synthetic" and "natural" tokens are
"natural tokens" are those which derive simply by splitting the message
body on whitespace boundaries.
"synthetic tokens" are those which are not "natural tokens".
Tim> Most of your spam clues come from the email *headers*, but that's
Tim> fair game. Note that mining received headers is disabled by
Tim> default, so you're getting a pile of clues most people aren't
Tim> getting. Maybe they should.
Sure, email headers are fair game, but if the tokenizer didn't do anything
special with them, that "subject:Free" token would at most just be "free" or
>> My 'train an unsure or false negative, check for spams' method
>> suggests this is the case, since training on a single message often
>> pushes several other spams about completely different topics into the
>> spam category.
Tim> I'm unclear on what's noteworthy about that. The biz domain is
Tim> used by lots of spam, lots of spam has a yahoo.com return address,
Tim> lots of spam is multipart/alternative HTML, and so on. Looks like
Tim> you're generating 4 correlated clues from a single Received header,
Tim> and that you got one spam before from the same box. Strangely,
Tim> though, it looks like you're sucking out *suffixes* of IP addrs
Tim> instead of prefixes (you've got
Tim> but not the almost-surely more useful
I don't know. I agree those look backwards (that's my mail server, BTW).
OTOH, given the fairly random assignment of IP networks, I doubt it makes
much sense for the above IP address to be stripped of more than the last two
octets ("received:22.214.171.124", "received:199.249.165" and
"received:199.249"). "recevied:199", where 199 is the first octet, not the
last, almost certainly means nothing. If it's spammy or hammy, it's just by
>> This suggests a couple other downsides to minimalist training. One,
>> spammers have to move, so hapaxes related to delivery are likely to
>> only be useful for a short period while the spammer is abusing a
>> single account.
Tim> IP *prefixes* should be useful despite that, due to the way IP
Tim> space is handed out. If you're a spammer with a cooperative host,
Tim> you're likely to get other IP addresses from the netblocks assigned
Tim> to that host, and they'll share a common prefix.
Again, no more general than the first two octets (a class B network). Class
A networks are very rare (for obvious reasons):
>> Two, if a delivery token pushes a bunch of other messages into the
>> spam category which are then never used as inputs to training, the
>> opportunity to reinforce that token's quality is lost, even though it
>> might actually appear fairly frequently in spam.
Tim> I expect 'subject:Free' was a fine example of that.
'subject:Free' is now slightly spammy, having turned up in three spams and
only one ham at this point.
More information about the Spambayes