[spambayes-dev] default to mine_received_headers=True, "may be forged"

Mon Dec 22 12:10:13 EST 2003

    >> While I was messing with the received header regular expressions
    >> today I also noticed that Sendmail sometimes adds "may be forged" to
    >> a header....

    >> I'm inclined to trust sendmail on this one and just add it.  It seems
    >> like a very objective feature.

    Tim> I agree -- it's extremely unlikely to lose.  The ones to worry
    Tim> about are things spammers could inject to push things in the ham
    Tim> direction, but they're not gonna get far forging "may be forged"
    Tim> unless I have a *very* weird idea of ham <wink>.

I just checked in tokenizer.py with this change.  Note that it's guarded by
options["Tokenizer", "mine_received_headers"].

Skip

    Tim> I noticed this in the headers of a spam today:

    Tim> Received: from shawmail-cg-shawcable-net
    Tim>        (c-24-9-163-244.client.comcast.net[24.9.163.244](untrusted sender))
    Tim>        by rwcrmxc11.comcast.net (rwcrmxc11) with SMTP
    Tim>        id <20031220054919r1100n4pj1e>; Sat, 20 Dec 2003 05:49:20 +0000

    Tim> It's the "(untrusted sender)" part that's interesting.  I'd suggest
    Tim> *not* folding that in with "may be forged", though.  There probably
    Tim> aren't a lot of strings of this nature, so the database burden
    Tim> should be trivial, and I *bet* different strings will prove to have
    Tim> different spamprobs.

You're probably right.  In this case it may just be that an ident lookup
failed (many servers don't run identd), so the assertion that the message is
spam would be much weaker.

Poking around Google a bit suggests "(untrusted sender)" is something
specific to Comcast.  I'm happy to add it if you would like, but in the mail
I've saved it actually seems to turn up a bit more in ham (six messages)
than in spam (one message) and not at all in my current training database.
All such lines also match "client2?\.attbi\.com".

Skip