[spambayes-dev] default to mine_received_headers=True, "may be forged"

Skip Montanaro skip at pobox.com
Fri Dec 19 14:57:16 EST 2003


I've been running with mine_received_headers set to True for quite awhile.
I fixed a couple nits this morning with the regular expressions used to pick
out hostnames and ip addresses from Received: headers.  The hostname re was
frequently picking up ip addresses and chomping them from the wrong end.  I
am pleased with how well it seems to work at this point(*).  Looking at a
graph or table of the 'received:.*' spamprob distribution shows that (for
me, at least) the bulk of the spamprobs are at or outside of the hapax
points.  See:

    http://www.musi-cal.com/~skip/rcvd.png
    http://www.musi-cal.com/~skip/rcvd.txt

The graph plots the number of features with a given spamprob.  The two
impulses at the hapax points are 523 (0.155...) and 1047 (0.844...).  I
cropped the graph so the smaller values would be visible.

Obviously, this is still strongly hapax-driven (I have a small database at
the moment - 163 spam, 171 ham), but the data suggests that the hapax values
are pretty good indicators of the direction that feature will take when the
second instance is seen.

While I was messing with the received header regular expressions today I
also noticed that Sendmail sometimes adds "may be forged" to a header.
Here's a bit from the sendmail docs in the context of an open relay
discussion:

    QAA02454: <ESCAPEFOUR at AOL.COM>... Relaying denied
    QAA02454: ruleset=check_rcpt, arg1=<ESCAPEFOUR at AOL.COM>,
            relay=some.domain [10.0.0.1] (may be forged),
        reject=550 <ESCAPEFOUR at AOL.COM>... Relaying denied
    QAA02454: from=<Anonymous at aol.com>, size=0, class=0, pri=0, nrcpts=0,
            proto=SMTP, relay=some.domain [10.0.0.1] (may be forged)

    Here the (may be forged) is the important part: it means that the DNS
    data for the host is inconsistent, and hence the name is not used for
    the relaying check but only the IP number.

This is also a very good spam indicator:

    % spamcounts -r 'may be forged'
    db: /Users/skip/.hammiedb
    token,nspam,nham,spam prob
    bi:received:may be forged received:mx,1,0,0.844827586207
    bi:received:may be forged received:biz,2,0,0.908163265306
    received:may be forged,5,0,0.95871559633
    bi:received:may be forged received:com,1,0,0.844827586207
    bi:received:127.0.0.1 received:may be forged,5,0,0.95871559633
    bi:received:may be forged received:il,1,0,0.844827586207

I generate it within the block controlled by the mine_received_headers
option.  A quick scan of my testing databases shows this is overwhelmingly
associated with spam (shows up in 221 out of 6843 spams and only 30 out of
8395 ham).

I'm inclined to trust sendmail on this one and just add it.  It seems like a
very objective feature.  In fact, if other mail transport agents provide
similar clues about forged addresses, I think we should look for their clues
and lump them all into one 'received:may be forged' feature.

Skip

(*) Here's a quick summary of my latest setup.  I'm running from CVS
(natch).  I pushed my cutoffs out to 0.05 and 0.95 and run with bigrams
enabled.  I train on all mistakes and unsures.  I also have it automatically
training on a random 10% of the messages with score as ham or spam.  I tried
training on everything, but the database was growing way too quickly.  The
extreme cutoffs minimize the chance of a fp or fn which would mean to
untrain I have to go find the message and move it from one pile to the
other.  So far, no fp's, a few fn's and fewer unsures than I anticipated.



More information about the spambayes-dev mailing list