[spambayes-dev] default to mine_received_headers=True, "may be forged"

Mon Dec 22 14:20:14 EST 2003

[Skip Montanaro]
> ...
> Poking around Google a bit suggests "(untrusted sender)" is something
> specific to Comcast.  I'm happy to add it if you would like, but in
> the mail I've saved it actually seems to turn up a bit more in ham
> (six messages) than in spam (one message) and not at all in my
> current training database. All such lines also match
> "client2?\.attbi\.com".

It really doesn't matter whether it looks hammy or spammy to you -- each
person's classifier learns "what works" for that person's email mix.  IOW,
I'm not looking for "spam clues" here, I'm looking for potentially
interesting raw data to throw at the classifier, be that hammy or spammy or
neutral.  It's the classifier's job to *learn* what's useful, but it can
only see what we explicitly show it.

A generalization of this gimmick finds several potentially interesting
Received comments in my current little training database:

'received:(built aug  5\n 2002)' spam: 0 ham: 1
'received:(built aug  5 2002)' spam: 0 ham: 1
'received:(built mar 18 2003)' spam: 0 ham: 2
'received:(built may\n 14 2003)' spam: 0 ham: 1
'received:(built may  7 2001)' spam: 0 ham: 1
'received:(built may 13 2002)' spam: 0 ham: 3
'received:(built may 14 2003)' spam: 0 ham: 6
'received:(built nov\n 25 2002)' spam: 0 ham: 2
'received:(built nov  6 2002)' spam: 0 ham: 2
'received:(built nov 25 2002)' spam: 0 ham: 3
'received:(built nov 6\n 2002)' spam: 0 ham: 2
'received:(built sep 23\n 2002)' spam: 0 ham: 1
'received:(built sep 23 2002)' spam: 0 ham: 2
'received:(helo bala)' spam: 0 ham: 1
'received:(helo cyb)' spam: 0 ham: 1
'received:(helo gamer)' spam: 0 ham: 1
'received:(helo hp751n)' spam: 0 ham: 1
'received:(helo mailscanner)' spam: 0 ham: 1
'received:(may\n\tbe forged)' spam: 0 ham: 1
'received:(no client certificate requested)' spam: 0 ham: 3
'received:(qmail 20043 invoked from network)' spam: 0 ham: 1
'received:(qmail 20649 invoked from network)' spam: 0 ham: 1
'received:(qmail 20705 invoked from network)' spam: 0 ham: 1
'received:(qmail 29420 invoked from network)' spam: 0 ham: 1
'received:(qmail 30856 invoked from network)' spam: 0 ham: 1
'received:(qmail 59242 invoked by uid 1002)' spam: 0 ham: 1
'received:(qmail 6276 invoked by uid 99)' spam: 0 ham: 1
'received:(qmail 6378 invoked from network)' spam: 0 ham: 1
'received:(qmail 6383 invoked from network)' spam: 0 ham: 1
'received:(qmail 76214 invoked by uid 0)' spam: 0 ham: 1
'received:(qmail 94959 invoked by uid 399)' spam: 0 ham: 1
'received:(built feb 13 2003)' spam: 1 ham: 1
'received:(helo 3sfm)' spam: 1 ham: 0
'received:(helo d1e)' spam: 1 ham: 0
'received:(helo lsi)' spam: 1 ham: 0
'received:(helo s9rr4v)' spam: 1 ham: 0
'received:(helo timslaptop)' spam: 1 ham: 0
'received:(helo xtr)' spam: 1 ham: 0
'received:(qmail 13979 invoked from network)' spam: 1 ham: 0
'received:(qmail 5950 invoked by uid 500)' spam: 1 ham: 0
'received:(sasktel mail service)' spam: 1 ham: 0
'received:(smtp server)' spam: 2 ham: 1
'received:(misconfigured sender)' spam: 12 ham: 5
'received:(may be forged)' spam: 3 ham: 1
'received:(untrusted sender)' spam: 9 ham: 3

Note that one of the "may be forged" comments there was split across lines
('(may\n\tbe forged)').

That was done via adding

received_complaints_re = re.compile(r'\(\w+(?:\s+\w+)+\)')

and replacing

               if header.lower().find('may be forged') != -1:
                   yield 'received:may be forged'

with
               for x in received_complaints_re.findall(header.lower()):
                   yield 'received:' + x

Since these feed into bigrams too, there are a lot more combinations.  Some
are purely spammy so far:

'bi:received:(untrusted sender) received:ca' spam: 3 ham: 0
'bi:received:63.240.213.250 received:(may be forged)' spam: 3 ham: 0

and some are purely hammy so far:

'bi:received:(built may 14 2003) received:172' spam: 0 ham: 5