[Spambayes] better Received header tokens

Sun Mar 9 22:13:40 EST 2003

[T. Alexander Popiel]
> Something that has occured to me recently: how many tokens does it
> take to significantly change the scores?  Most of the recent tokenizing
> experiments have been adding between one and a handful of tokens, or
> even reducing token count.  Perhaps our problem is not that the
> identification methods we're coming up with are bad (heck, Tim did
> indicate that the bytes/word token _was_ a strong indicator... I
> didn't look at the values for the token itself), but rather that
> these new methods of identification are getting drowned out in the
> noise.

Oddly, I doubt it matters.  The median ham score is near 0, and the median
spam score near 100, so most messages are very solidly at one end.  When a
new token is added, it's not going to have any substantial effect on those,
it's going to affect Unsures, and msgs near the Unsure cutoffs.  One token
is enough to swing a msg near a boundary to the other side.

Note that strong indicators aren't necessarily *good* indicators, either:
if they're strongly correlated with other strong indicators, a bad decision
is easy to get.  That's why we strip HTML decorations, for example.  For
another, about the only spam I see rate unsure anymore is stuff that leaks
thru SpamAssassin via python.org.  spambayes *usually* wouldn't have any
trouble with such spam on its own, but there are a dozen header clues all
effectively saying "this came from python.org" then, and those are all
strong ham clues (thanks to SpamAssassin's usual effectiveness).  However,
they're really all the same clue, and the system has no way to realize that;
treating them as a dozen distinct clues gives them way more credence than
they deserve.