[Spambayes] Watch out for digests...

Skip Montanaro skip at pobox.com
Sun Dec 14 21:13:43 EST 2003

    >> X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
    >> 'subject:Free': 0.16;

    Tim> "Free" in a Subject line and "drug" in the body are hammy for you?
    Tim> Staring at clues from mistake-based training can be, umm,
    Tim> counter-intuitive <wink>.

Yeah, one of the online communities I participate in is a list of parents of
"troubled kids", hence the hammy "drug" reference.  "subject:Free" comes
from the music community:

    Subject: SFS Special Announcement (Free Guest List to Fluid this Friday)

    >> 'store': 0.23; 'next': 0.25; 'list,': 0.30;
    >> 'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
    >> 'header:Reply-To:1': 0.64; 'enter': 0.67;
    >> 'content-type:multipart/alternative': 0.68;
    >> 'content-type:text/html': 0.74; 'doctors': 0.84;
    >> 'prescription': 0.84; 'received:103]': 0.84;
    >> 'received:165.175': 0.84; 'received:175': 0.84;
    >> 'received:': 0.84; 'received:249.165.175':
    >> 0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98
    >> Most of the spammy clues are synthetic tokens related to delivery
    >> (and are mostly hapaxes), not content.

    Tim> I'm not sure what's synthetic about these.  

I guess my operational definitions of "synthetic" and "natural" tokens are
in order:

    "natural tokens" are those which derive simply by splitting the message
    body on whitespace boundaries.

    "synthetic tokens" are those which are not "natural tokens".

    Tim> Most of your spam clues come from the email *headers*, but that's
    Tim> fair game.  Note that mining received headers is disabled by
    Tim> default, so you're getting a pile of clues most people aren't
    Tim> getting.  Maybe they should.

Sure, email headers are fair game, but if the tokenizer didn't do anything
special with them, that "subject:Free" token would at most just be "free" or

    >> My 'train an unsure or false negative, check for spams' method
    >> suggests this is the case, since training on a single message often
    >> pushes several other spams about completely different topics into the
    >> spam category.

    Tim> I'm unclear on what's noteworthy about that.  The biz domain is
    Tim> used by lots of spam, lots of spam has a yahoo.com return address,
    Tim> lots of spam is multipart/alternative HTML, and so on.  Looks like
    Tim> you're generating 4 correlated clues from a single Received header,
    Tim> and that you got one spam before from the same box.  Strangely,
    Tim> though, it looks like you're sucking out *suffixes* of IP addrs
    Tim> instead of prefixes (you've got

    Tim>         249.165.175
    Tim>             165.175
    Tim> and
    Tim>                 175

    Tim> but not the almost-surely more useful

    Tim>     199.249.165
    Tim>     199.249
    Tim> and
    Tim>     199
    Tim> ).

I don't know.  I agree those look backwards (that's my mail server, BTW).
OTOH, given the fairly random assignment of IP networks, I doubt it makes
much sense for the above IP address to be stripped of more than the last two
octets ("received:", "received:199.249.165" and
"received:199.249").  "recevied:199", where 199 is the first octet, not the
last, almost certainly means nothing.  If it's spammy or hammy, it's just by
sheer coincidence.

    >> This suggests a couple other downsides to minimalist training.  One,
    >> spammers have to move, so hapaxes related to delivery are likely to
    >> only be useful for a short period while the spammer is abusing a
    >> single account.

    Tim> IP *prefixes* should be useful despite that, due to the way IP
    Tim> space is handed out.  If you're a spammer with a cooperative host,
    Tim> you're likely to get other IP addresses from the netblocks assigned
    Tim> to that host, and they'll share a common prefix.

Again, no more general than the first two octets (a class B network).  Class
A networks are very rare (for obvious reasons):


    >> Two, if a delivery token pushes a bunch of other messages into the
    >> spam category which are then never used as inputs to training, the
    >> opportunity to reinforce that token's quality is lost, even though it
    >> might actually appear fairly frequently in spam.

    Tim> I expect 'subject:Free' was a fine example of that.

'subject:Free' is now slightly spammy, having turned up in three spams and
only one ham at this point.


More information about the Spambayes mailing list