[spambayes-dev] default to mine_received_headers=True,
"may be forged"
Richie Hindle
richie at entrian.com
Mon Dec 22 18:57:27 EST 2003
> If anyone else would like to generate some raw data
Your script didn't define 'pat' - I've assumed you meant:
pat = re.compile(r'\(\w+(?:\s+\w+)+\)')
Here's what I get from my corpus of 20,000 verified spams:
[(3, '(HELO 0j3x2or)'),
(3, '(HELO 2vqmm)'),
(3, '(HELO 3bn0dn2)'),
(3, '(HELO 3frty7)'),
(3, '(HELO 6qzmi3)'),
(3, '(HELO QRJATYDI)'),
(3, '(HELO ben)'),
(3, '(HELO d9vyix)'),
(3, '(HELO ic6nlfq)'),
(3, '(HELO laabud)'),
(3, '(HELO ojeudcb)'),
(3, '(HELO pebbyrl)'),
(3, '(HELO pm9he0)'),
(3, '(HELO r26)'),
(3, '(HELO richie)'),
(3, '(HELO vzjqt6x)'),
(3, '(HELO xhz5j)'),
(3, '(HELO yu5s)'),
(3, '(untrusted sender)'),
(4, '(built Aug 19 2002)'),
(4, '(built May 7 2001)'),
(6, '(HELO kos)'),
(6, '(built Jul 28 2003)'),
(6, '(built Oct 18 2002)'),
(7, '(built Feb 21 2002)'),
(8, '(HELO localhost)'),
(9, '(built Sep 8 2003)'),
(11, '(HELO pm69)'),
(12, '(built Feb 13 2003)'),
(15, '(HELO pm65)'),
(18, '(built Mar 18 2003)'),
(21, '(built May 14 2003)'),
(27, '(SMTP Server)'),
(149, '(may be forged)')]
And these from the 12,000 or so message in the spambayes and spambayes-dev
archives - not 100% spam-free, but very very nearly:
[(3, '(HELO GR43)'),
(3, '(HELO WPWD0038)'),
(3, '(HELO diffy2)'),
(3, '(HELO gamer)'),
(3, '(built Jul 12 2002)'),
(4, '(HELO jimws)'),
(4, '(HELO localhost)'),
(6, '(HELO dj2klap)'),
(6, '(built Feb 21 2002)'),
(6, '(built Sep 8 2003)'),
(7, '(userid 1)'),
(8, '(EHLO localhost)'),
(8, '(MET DST)'),
(8, '(No client certificate requested)'),
(8, '(SquirrelMail authenticated user gaza)'),
(8, '(built Jul 28 2003)'),
(9, '(0 bits)'),
(11, '(HELO STRIPER)'),
(11, '(built Jan 23 2003)'),
(11, '(built Oct 18 2002)'),
(11, '(sSMTP sendmail emulation)'),
(13, '(HELO jim)'),
(13, '(SMTP Server)'),
(16, '(built Nov 6 2002)'),
(21, '(built Nov 25 2002)'),
(26, '(HELO striper)'),
(27, '(built Jul 29 2002)'),
(28, '(built Jan 7 2003)'),
(33, '(misconfigured sender)'),
(34, '(userid 4)'),
(35, '(HELO lion)'),
(51, '(may be forged)'),
(59, '(built Feb 13 2003)'),
(86, '(built May 14 2003)'),
(99, '(built Sep 23 2002)'),
(100, '(built Mar 18 2003)'),
(101, '(built May 13 2002)'),
(158, '(untrusted sender)'),
(364, '(built Aug 5 2002)')]
So "(may be forged)" would be a weak spam clue for me, while "(untrusted
sender)" would be a strong ham clue - but 133 of those 158 are from Tim...
Even taking Tim out of the equation, it's 25-to-3 in favour of ham. The
other 25 are from maybe a dozen other people. Ah - all are either
attbi.com or comcast.net. Here's an example of an attbi.com one:
Received: from hal2
(h00e01840da57.ne.client2.attbi.com[24.91.108.212](untrusted sender))
by attbi.com (rwcrmhc11) with SMTP
id <2003061814314101300an5bve>; Wed, 18 Jun 2003 14:31:42 +0000
Message-ID: <ENELLFEIPIANCGOIGFOEOEMIDPAA.RCaro at CMC.us>
Make of all that what you will.
--
Richie Hindle
richie at entrian.com
More information about the spambayes-dev
mailing list