[Spambayes] better Received header tokens
skip at pobox.com
Mon Mar 10 11:47:20 EST 2003
Tim> It was an example of harmful correlation, by way of illustrating
Tim> why a strong indicator isn't necessarily a desirable indicator.
Tim> This particular example applies pretty directly to any source from
Tim> which a user rarely (but not never) gets spam, and leaves clues
Tim> about itself.
True enough. I'm sure there are lots of such correlations. But if a
person's incoming mail isn't dominated by one source, such harmful
correlations will have less impact on the final score of any given message,
right? As an example, I just grep'd my ham collection for the Sender field,
squashed case, sorted and uniq'd, then sorted again. The tail end looked
150 sender: folkmusic-admin at grassyhill.org
221 sender: zope-admin at zope.org
255 sender: folk music presenters <folkvenu at lists.psu.edu>
450 sender: spambayes-bounces at python.org
550 sender: python-checkins-admin at python.org
555 sender: owner-6pack at autox.team.net
688 sender: python-dev-admin at python.org
821 sender: spamassassin-talk-admin at lists.sourceforge.net
1387 sender: cedu-admin at manatee.mojam.com
3091 sender: python-list-admin at python.org
This is out of 9609 Sender headers (just under 12,000 hams). If I remember
comments you've made on this topic in the past, I expect your Sender:
headers to be more strongly dominated by Python-related messages than this.
Just the presence of a Sender header irregardless of where it came from
seems to be a pretty strong ham clue (something spammers could/do exploit?).
My roughly 7,000 spams only have 759 Sender headers. I haven't experimented
with adding it to Options.options.address_headers, but your comment in
tokenizer.py suggests this probably wouldn't be too wise.
More information about the Spambayes