Now THAT's FUNNY! ;-O - Bill stuart@tessco.com
-----Original Message----- From: yourvacation@starmail.com [mailto:yourvacation@starmail.com] Sent: Wednesday, July 30, 2003 2:39 PM To: SpamBayes@python.org Subject: [Spambayes] Thank you
Thank you for taking your time to read this email... it could be benefical for you!
[spam deleted...]
I was laughing even harder when I found that that "Thank you" spam had been dropped in the trash (p=0.53), since it's taken 3 good->spam re-classifications to convince my system that there could *ever* be junk mail coming from this list. In fact, those 3 were the only spams that have slipped through my filter in the past month! Interestingly, the *only* spam clue in my top-ten tokens was the word "skeptical", which came in with a raw probability of 0.98 (everything else was under 0.10). But that was just enough, apparently, to nudge it over into the dark side. Lurking spammers: take note, don't use that word... ;-)
Now THAT's FUNNY! ;-O
- Bill stuart@tessco.com
-----Original Message----- From: yourvacation@starmail.com [mailto:yourvacation@starmail.com] Sent: Wednesday, July 30, 2003 2:39 PM To: SpamBayes@python.org Subject: [Spambayes] Thank you
Thank you for taking your time to read this email... it could be benefical for you!
[spam deleted...]
_______________________________________________ Spambayes mailing list Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
"Richard Jowsey" wrote Interestingly, the *only* spam clue in my top-ten tokens was the word "skeptical", which came in with a raw probability of 0.98 (everything else was under 0.10). But that was just enough, apparently, to nudge it over into the dark side.
It showed up as solidly 'ham' for me, '*H*': 0.99; '*S*': 0.03, but there were a bunch of solid spam clues: 'income': 0.89; 'magazines,': 0.90; 'list-id:Discussion': 0.91; 'now!': 0.92; 'you!': 0.92; 'subject:you': 0.94; 'income.': 0.94; 'offers,': 0.95; '"remove"': 0.95; 'money!': 0.97 The problem was that mailman puts in a hell of a lot of headers and suchlike: 'everywhere,': 0.04; 'url:mailman': 0.06; 'errors-to:python.org': 0.06; 'list-archive:skip:m 10': 0.06; 'list-help:python.org': 0.06; 'list-post:python.org': 0.06; 'list-subscribe:python.org': 0.06; 'list-subscribe:skip:m 10': 0.06; 'list-unsubscribe:python.org': 0.06; 'list-unsubscribe:skip:m 10': 0.06; 'return-path:python.org': 0.06; 'sender:python.org': 0.06; 'url:python': 0.07; 'list-subscribe:mailman': 0.08; 'list-unsubscribe:mailman': 0.08; 'email addr:python.org': 0.08; 'list-archive:pipermail': 0.08; 'url:listinfo': 0.08; 'list-subscribe:http': 0.08; 'list-unsubscribe:http': 0.08; 'broke': 0.08; 'header:Errors-To:1': 0.09; 'to:python.org': 0.09; 'skip:_ 40': 0.09; 'spambayes': 0.09; 'subject:Spambayes': 0.09; 'list-id:list': 0.09; 'subject:] ': 0.10; 'list-id:for': 0.10; 'list-subscribe:mailto': 0.11; 'list-help:help': 0.11; 'list-help:mailto': 0.11; 'list-help:subject': 0.11; 'list-subscribe:subject': 0.11; 'list-subscribe:subscribe': 0.11; 'list-unsubscribe:subject': 0.11; 'list-help:request': 0.11; 'list-post:mailto': 0.11; 'list-subscribe:listinfo': 0.11; 'list-subscribe:request': 0.11; 'list-unsubscribe:listinfo': 0.11; 'list-unsubscribe:request': 0.11; 'list-archive:http': 0.11; 'that.': 0.12; 'list-unsubscribe:unsubscribe': 0.12; 'list-unsubscribe:mailto': 0.13; 'looked': 0.13; 'tie': 0.14; This suggests we could probably be smarter about parsing headers from mailman to reduce the number of highly correlated clues.
[Anthony Baxter]
...
The problem was that mailman puts in a hell of a lot of headers and suchlike: 'everywhere,': 0.04; 'url:mailman': 0.06; 'errors-to:python.org': 0.06; 'list-archive:skip:m 10': 0.06; 'list-help:python.org': 0.06; 'list-post:python.org': 0.06; 'list-subscribe:python.org': 0.06; 'list-subscribe:skip:m 10': 0.06; 'list-unsubscribe:python.org': 0.06; 'list-unsubscribe:skip:m 10': 0.06; 'return-path:python.org': 0.06; 'sender:python.org': 0.06; 'url:python': 0.07; 'list-subscribe:mailman': 0.08; 'list-unsubscribe:mailman': 0.08; 'email addr:python.org': 0.08; 'list-archive:pipermail': 0.08; 'url:listinfo': 0.08; 'list-subscribe:http': 0.08; 'list-unsubscribe:http': 0.08; 'broke': 0.08; 'header:Errors-To:1': 0.09; 'to:python.org': 0.09; 'skip:_ 40': 0.09; 'spambayes': 0.09; 'subject:Spambayes': 0.09; 'list-id:list': 0.09; 'subject:] ': 0.10; 'list-id:for': 0.10; 'list-subscribe:mailto': 0.11; 'list-help:help': 0.11; 'list-help:mailto': 0.11; 'list-help:subject': 0.11; 'list-subscribe:subject': 0.11; 'list-subscribe:subscribe': 0.11; 'list-unsubscribe:subject': 0.11; 'list-help:request': 0.11; 'list-post:mailto': 0.11; 'list-subscribe:listinfo': 0.11; 'list-subscribe:request': 0.11; 'list-unsubscribe:listinfo': 0.11; 'list-unsubscribe:request': 0.11; 'list-archive:http': 0.11; 'that.': 0.12; 'list-unsubscribe:unsubscribe': 0.12; 'list-unsubscribe:mailto': 0.13; 'looked': 0.13; 'tie': 0.14;
This suggests we could probably be smarter about parsing headers from mailman to reduce the number of highly correlated clues.
Well, as designed, and as it does by default, and as the Outlook client still does, spambayes ignores all list-XYZ header lines. You must have enabled some dubious <wink> option if you're getting all that crud in your database. Even without that stuff, the tokenizer picks up plenty of "spambayes list" clues: "'spambayes':" 'cc:addr:python.org' 'cc:addr:spambayes' 'email addr:python.org' 'email name:spambayes' 'mailman' 'sender:addr:python.org' 'sender:addr:spambayes-bounces' 'sender:no real name:2**0' 'spambayes' 'subject:Spambayes' 'url:listinfo' 'url:mail' 'url:mailman' 'url:python' 'url:spambayes' With the exception of 'sender:no real name:2**0', those are probably strong ham tokens for most of us here. Sometimes cross-clue correlation hurts, as it does here. It's visible then. What's much harder to see is that correlation seems more often to help. Too bad we stopped doing research here <wink>.
participants (4)
-
Anthony Baxter -
Richard Jowsey -
Stuart, Bill -
Tim Peters