[spambayes-dev] RE: [Spambayes] Watch out for digests...
Skip Montanaro
skip at pobox.com
Sun Dec 14 21:13:43 EST 2003
>> X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
>> 'subject:Free': 0.16;
Tim> "Free" in a Subject line and "drug" in the body are hammy for you?
Tim> Staring at clues from mistake-based training can be, umm,
Tim> counter-intuitive <wink>.
Yeah, one of the online communities I participate in is a list of parents of
"troubled kids", hence the hammy "drug" reference. "subject:Free" comes
from the music community:
Subject: SFS Special Announcement (Free Guest List to Fluid this Friday)
>> 'store': 0.23; 'next': 0.25; 'list,': 0.30;
>> 'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
>> 'header:Reply-To:1': 0.64; 'enter': 0.67;
>> 'content-type:multipart/alternative': 0.68;
>> 'content-type:text/html': 0.74; 'doctors': 0.84;
>> 'prescription': 0.84; 'received:103]': 0.84;
>> 'received:165.175': 0.84; 'received:175': 0.84;
>> 'received:199.249.165.175': 0.84; 'received:249.165.175':
>> 0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98
>>
>> Most of the spammy clues are synthetic tokens related to delivery
>> (and are mostly hapaxes), not content.
Tim> I'm not sure what's synthetic about these.
I guess my operational definitions of "synthetic" and "natural" tokens are
in order:
"natural tokens" are those which derive simply by splitting the message
body on whitespace boundaries.
"synthetic tokens" are those which are not "natural tokens".
Tim> Most of your spam clues come from the email *headers*, but that's
Tim> fair game. Note that mining received headers is disabled by
Tim> default, so you're getting a pile of clues most people aren't
Tim> getting. Maybe they should.
Sure, email headers are fair game, but if the tokenizer didn't do anything
special with them, that "subject:Free" token would at most just be "free" or
"Free".
>> My 'train an unsure or false negative, check for spams' method
>> suggests this is the case, since training on a single message often
>> pushes several other spams about completely different topics into the
>> spam category.
Tim> I'm unclear on what's noteworthy about that. The biz domain is
Tim> used by lots of spam, lots of spam has a yahoo.com return address,
Tim> lots of spam is multipart/alternative HTML, and so on. Looks like
Tim> you're generating 4 correlated clues from a single Received header,
Tim> and that you got one spam before from the same box. Strangely,
Tim> though, it looks like you're sucking out *suffixes* of IP addrs
Tim> instead of prefixes (you've got
Tim> 199.249.165.175
Tim> 249.165.175
Tim> 165.175
Tim> and
Tim> 175
Tim> but not the almost-surely more useful
Tim> 199.249.165
Tim> 199.249
Tim> and
Tim> 199
Tim> ).
I don't know. I agree those look backwards (that's my mail server, BTW).
OTOH, given the fairly random assignment of IP networks, I doubt it makes
much sense for the above IP address to be stripped of more than the last two
octets ("received:199.249.165.175", "received:199.249.165" and
"received:199.249"). "recevied:199", where 199 is the first octet, not the
last, almost certainly means nothing. If it's spammy or hammy, it's just by
sheer coincidence.
>> This suggests a couple other downsides to minimalist training. One,
>> spammers have to move, so hapaxes related to delivery are likely to
>> only be useful for a short period while the spammer is abusing a
>> single account.
Tim> IP *prefixes* should be useful despite that, due to the way IP
Tim> space is handed out. If you're a spammer with a cooperative host,
Tim> you're likely to get other IP addresses from the netblocks assigned
Tim> to that host, and they'll share a common prefix.
Again, no more general than the first two octets (a class B network). Class
A networks are very rare (for obvious reasons):
http://euclid.math.brandeis.edu/turtschi/whois/neta1.html
>> Two, if a delivery token pushes a bunch of other messages into the
>> spam category which are then never used as inputs to training, the
>> opportunity to reinforce that token's quality is lost, even though it
>> might actually appear fairly frequently in spam.
Tim> I expect 'subject:Free' was a fine example of that.
'subject:Free' is now slightly spammy, having turned up in three spams and
only one ham at this point.
Skip
More information about the spambayes-dev
mailing list