[spambayes-dev] default to mine_received_headers=True, "may be forged"

Sat Dec 20 23:27:22 EST 2003

[Skip Montanaro]
> I've been running with mine_received_headers set to True for quite
> awhile. I fixed a couple nits this morning with the regular
> expressions used to pick out hostnames and ip addresses from
> Received: headers.  The hostname re was frequently picking up ip
> addresses and chomping them from the wrong end.  I am pleased with
> how well it seems to work at this point(*).  Looking at a graph or
> table of the 'received:.*' spamprob distribution shows that (for me,
> at least) the bulk of the spamprobs are at or outside of the hapax
> points.  See:
>
>     http://www.musi-cal.com/~skip/rcvd.png
>     http://www.musi-cal.com/~skip/rcvd.txt
>
> The graph plots the number of features with a given spamprob.  The two
> impulses at the hapax points are 523 (0.155...) and 1047 (0.844...).
> I cropped the graph so the smaller values would be visible.
>
> Obviously, this is still strongly hapax-driven (I have a small
> database at the moment - 163 spam, 171 ham), but the data suggests
> that the hapax values are pretty good indicators of the direction
> that feature will take when the second instance is seen.

Cool!  Thanks for the good work.  I'll give this a try too.

> While I was messing with the received header regular expressions
> today I also noticed that Sendmail sometimes adds "may be forged" to
> a header. Here's a bit from the sendmail docs in the context of an
> open relay discussion:
>
>     QAA02454: <ESCAPEFOUR at AOL.COM>... Relaying denied
>     QAA02454: ruleset=check_rcpt, arg1=<ESCAPEFOUR at AOL.COM>,
>             relay=some.domain [10.0.0.1] (may be forged),
>         reject=550 <ESCAPEFOUR at AOL.COM>... Relaying denied
>     QAA02454: from=<Anonymous at aol.com>, size=0, class=0, pri=0,
>             nrcpts=0, proto=SMTP, relay=some.domain [10.0.0.1] (may
> be forged)
>
>     Here the (may be forged) is the important part: it means that the
>     DNS data for the host is inconsistent, and hence the name is not
>     used for the relaying check but only the IP number.
>
> This is also a very good spam indicator:
>
>     % spamcounts -r 'may be forged'
>     db: /Users/skip/.hammiedb
>     token,nspam,nham,spam prob
>     bi:received:may be forged received:mx,1,0,0.844827586207
>     bi:received:may be forged received:biz,2,0,0.908163265306
>     received:may be forged,5,0,0.95871559633
>     bi:received:may be forged received:com,1,0,0.844827586207
>     bi:received:127.0.0.1 received:may be forged,5,0,0.95871559633
>     bi:received:may be forged received:il,1,0,0.844827586207
>
> I generate it within the block controlled by the mine_received_headers
> option.  A quick scan of my testing databases shows this is
> overwhelmingly associated with spam (shows up in 221 out of 6843
> spams and only 30 out of 8395 ham).
>
> I'm inclined to trust sendmail on this one and just add it.  It seems
> like a very objective feature.

I agree -- it's extremely unlikely to lose.  The ones to worry about are
things spammers could inject to push things in the ham direction, but
they're not gonna get far forging "may be forged" unless I have a *very*
weird idea of ham <wink>.

> In fact, if other mail transport agents provide similar clues about
> forged addresses, I think we should look for their clues and lump them
> all into one 'received:may be forged' feature.

I noticed this in the headers of a spam today:

Received: from shawmail-cg-shawcable-net
	(c-24-9-163-244.client.comcast.net[24.9.163.244](untrusted sender))
	by rwcrmxc11.comcast.net (rwcrmxc11) with SMTP
	id <20031220054919r1100n4pj1e>; Sat, 20 Dec 2003 05:49:20 +0000

It's the "(untrusted sender)" part that's interesting.  I'd suggest *not*
folding that in with "may be forged", though.  There probably aren't a lot
of strings of this nature, so the database burden should be trivial, and I
*bet* different strings will prove to have different spamprobs.

> (*) Here's a quick summary of my latest setup.  I'm running from CVS
> (natch).  I pushed my cutoffs out to 0.05 and 0.95 and run with
> bigrams enabled.  I train on all mistakes and unsures.  I also have
> it automatically training on a random 10% of the messages with score
> as ham or spam.  I tried training on everything, but the database was
> growing way too quickly.  The extreme cutoffs minimize the chance of
> a fp or fn which would mean to untrain I have to go find the message
> and move it from one pile to the other.  So far, no fp's, a few fn's
> and fewer unsures than I anticipated.

I'm running 0.04 and 0.95 with bigrams now, sticking to just
mistake-and-unsure training, after seeding with 50 of each, although the
seeds were the most recent trained on from my mistake-and-unsure-trained
unigram classifer.  Am at about 145 of each now.  I don't trust it yet --
it's still surprising too often.  I had disappointing results with a purely
mistake/unsure-trained unigram classifier before; the bigram one isn't
disappointing so far, it just leaves me cautious after a few days.  I expect
(without proof) that *some* random component is very helpful, at least to
get the thing started.

It's still 89% hapax.  I had expected that percentage to drop by now, but
without a random component I'm not sure that was a reasonable expectation:

 spam+ham     count         %      cumm
        1     63611     88.85     88.85
        2      4126      5.76     94.61
        3      1377      1.92     96.54
        4       680      0.95     97.49
        5       397      0.55     98.04
        6       255      0.36     98.40
        7       178      0.25     98.65
        8       134      0.19     98.83
        9       109      0.15     98.98
       10        70      0.10     99.08
 ...