[Spambayes] better Received header tokens

Skip Montanaro skip at pobox.com
Mon Mar 10 11:47:20 EST 2003


    Tim> It was an example of harmful correlation, by way of illustrating
    Tim> why a strong indicator isn't necessarily a desirable indicator.
    Tim> This particular example applies pretty directly to any source from
    Tim> which a user rarely (but not never) gets spam, and leaves clues
    Tim> about itself.

True enough.  I'm sure there are lots of such correlations.  But if a
person's incoming mail isn't dominated by one source, such harmful
correlations will have less impact on the final score of any given message,
right?  As an example, I just grep'd my ham collection for the Sender field,
squashed case, sorted and uniq'd, then sorted again.  The tail end looked
like

     150 sender: folkmusic-admin at grassyhill.org
     221 sender: zope-admin at zope.org
     255 sender: folk music presenters <folkvenu at lists.psu.edu>
     450 sender: spambayes-bounces at python.org
     550 sender: python-checkins-admin at python.org
     555 sender: owner-6pack at autox.team.net
     688 sender: python-dev-admin at python.org
     821 sender: spamassassin-talk-admin at lists.sourceforge.net
    1387 sender: cedu-admin at manatee.mojam.com
    3091 sender: python-list-admin at python.org

This is out of 9609 Sender headers (just under 12,000 hams).  If I remember
comments you've made on this topic in the past, I expect your Sender:
headers to be more strongly dominated by Python-related messages than this.

Just the presence of a Sender header irregardless of where it came from
seems to be a pretty strong ham clue (something spammers could/do exploit?).
My roughly 7,000 spams only have 759 Sender headers.  I haven't experimented
with adding it to Options.options.address_headers, but your comment in
tokenizer.py suggests this probably wouldn't be too wise.

Skip



More information about the Spambayes mailing list