[Spambayes] Thank you
Anthony Baxter
anthony at interlink.com.au
Thu Jul 31 15:45:21 EDT 2003
>>> "Richard Jowsey" wrote
> Interestingly, the *only* spam clue in my top-ten tokens was the word
> "skeptical", which came in with a raw probability of 0.98 (everything
> else was under 0.10). But that was just enough, apparently, to nudge
> it over into the dark side.
It showed up as solidly 'ham' for me, '*H*': 0.99; '*S*': 0.03,
but there were a bunch of solid spam clues:
'income': 0.89; 'magazines,': 0.90; 'list-id:Discussion': 0.91;
'now!': 0.92; 'you!': 0.92; 'subject:you': 0.94; 'income.': 0.94;
'offers,': 0.95; '"remove"': 0.95; 'money!': 0.97
The problem was that mailman puts in a hell of a lot of headers and
suchlike:
'everywhere,': 0.04; 'url:mailman': 0.06;
'errors-to:python.org': 0.06; 'list-archive:skip:m 10': 0.06;
'list-help:python.org': 0.06; 'list-post:python.org': 0.06;
'list-subscribe:python.org': 0.06;
'list-subscribe:skip:m 10': 0.06;
'list-unsubscribe:python.org': 0.06;
'list-unsubscribe:skip:m 10': 0.06;
'return-path:python.org': 0.06; 'sender:python.org': 0.06;
'url:python': 0.07; 'list-subscribe:mailman': 0.08;
'list-unsubscribe:mailman': 0.08; 'email addr:python.org': 0.08;
'list-archive:pipermail': 0.08; 'url:listinfo': 0.08;
'list-subscribe:http': 0.08; 'list-unsubscribe:http': 0.08;
'broke': 0.08; 'header:Errors-To:1': 0.09; 'to:python.org': 0.09;
'skip:_ 40': 0.09; 'spambayes': 0.09; 'subject:Spambayes': 0.09;
'list-id:list': 0.09; 'subject:] ': 0.10; 'list-id:for': 0.10;
'list-subscribe:mailto': 0.11; 'list-help:help': 0.11;
'list-help:mailto': 0.11; 'list-help:subject': 0.11;
'list-subscribe:subject': 0.11; 'list-subscribe:subscribe': 0.11;
'list-unsubscribe:subject': 0.11; 'list-help:request': 0.11;
'list-post:mailto': 0.11; 'list-subscribe:listinfo': 0.11;
'list-subscribe:request': 0.11;
'list-unsubscribe:listinfo': 0.11;
'list-unsubscribe:request': 0.11; 'list-archive:http': 0.11;
'that.': 0.12; 'list-unsubscribe:unsubscribe': 0.12;
'list-unsubscribe:mailto': 0.13; 'looked': 0.13; 'tie': 0.14;
This suggests we could probably be smarter about parsing headers from
mailman to reduce the number of highly correlated clues.
More information about the Spambayes
mailing list