[spambayes-dev] more selective Received: header mining...

Tue Nov 25 21:36:20 EST 2003

[Skip Montanaro]
> I made a change to the mine_received_headers stuff this evening,
> adding a new option, gateway_machines.  The idea is that the only
> Received: header which is really useful is the one which crosses the
> boundary between your known "good" network and the wild free-for-all
> part of the net.  Received: headers from hosts internal to your
> network are meaningless, since for the most part, all mail passes
> through them, while Received: headers from hosts external to your
> network probably just contain random garbage which clogs your
> database with meaningless tokens.

I don't know that that's so.  On the spam side, some spammers forge a
sequence of Received headers to make it appear as if the path to your
machine was legitimate, and the specific paths they forge can be clues.  On
the ham side, different senders' emails often take different paths that
leave behind distinctive clues on their end of the pipe.

If a token in the database is indeed worthless, that can be detected by (1)
the token is never used for scoring anymore; and/or, (2) the token has a
spamprob in the range we ignore.  If your real concern is purging useless
tokens, then analysis based on #1 and #2 should identify huge masses of
useless tokens, including all due to Received headers.  #1 is hard to do
now, of course (since we don't save any token access-time info in the
database).

BTW, the Outlook addin currently leaves mine_received_headers at its default
False, so I don't have any tokens due to Received lines in my databases.