[spambayes-dev] more selective Received: header mining...

Skip Montanaro skip at pobox.com
Sat Nov 22 01:17:26 EST 2003


I made a change to the mine_received_headers stuff this evening, adding a
new option, gateway_machines.  The idea is that the only Received: header
which is really useful is the one which crosses the boundary between your
known "good" network and the wild free-for-all part of the net.  Received:
headers from hosts internal to your network are meaningless, since for the
most part, all mail passes through them, while Received: headers from hosts
external to your network probably just contain random garbage which clogs
your database with meaningless tokens.

On the other hand, information from the point at which your mail system
receives a message can be useful.  You can trust your network's mail server
to at least get the IP address of the delivering host.  When processing
Received: headers, I use the gateway_machines option (a regular expression)
to detect when I first encounter an SMTP server I trust.  I have four useful
email addresses: skip at mojam.com, skip at pobox.com, skip at python.org and
montanaro at users.sourceforge.net, so I set gateway_machines to

    mojam\.com|pobox\.com|python\.org|sourceforge\.net

The attached context diff implements the change.  If you leave
gateway_machines an empty string, mine_received_headers will have it's
original meaning.  If you set it to something, it will cause only the
earliest Received: header which matches your regular expression to be
processed.

It's hard to tell how well this will work, since improvements are
necessarily very small at this stage of the game.  It certainly seems like
it might be time-sensitive.  Machines which were open relays a year ago may
be closed off now, forcing spammers to use different routes to your mailbox.
I'm thinking it might be more helpful with small training databases and
small messages, as it adds more relevant clues for the classfier to munch
on.

My only testing to this point has been to see how it does on my current
unsure mailbox.  At the moment it contains about 50 messages, a mixture of
ham and spam (though mostly spam) which all scored unsure when they landed
there and which for one reason or another I have yet to delete or save
somewhere else.  Before enabling gateway_machines no messages scored as
spam.  After enabling it to the above regex and retraining from scratch
(~170 hams and 250 spams), three more messages from my unsure mailbox scored
as spam.

Not surprisingly, the number of 'received:' records in my training database
dropped substantially (from 2289 to 1254) after enabling this.

Finally, note that the couple of context diffs here were pulled out of
already modified versions of tokenizer.py and Options.py, so patch will
probably apply them with offsets.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diffs
Type: application/octet-stream
Size: 5414 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031122/b4485c26/sb.obj


More information about the spambayes-dev mailing list