[Spambayes] Ping: subject header ignored? [was: Not mining mySubject headers?]
Seth Goodman
sethg at goodmanassociates.com
Tue Feb 6 19:03:09 CET 2007
David Abrahams wrote on Tuesday, February 06, 2007 11:05 AM -0600:
> David Abrahams <dave at boost-consulting.com> writes:
>
> > How is it that for a message with
> >
> > Subject: Huge online pharmacy
> >
> > Spambayes isn't using "pharmacy" as a classification token? I can't
> > find a setting that will make it do that, either.
>
> Am I just misinterpreting what I'm seeing, or does SB really ignore
> the Subject header?
The subject header produces tokens that start with the string
"subject:". When looking at the list of clues Spambayes finds, you
first see the list of "significant tokens", which means up to 150 (?)
tokens that score below 0.4 and above 0.6. The complete list is shown
as "all message tokens". It a token appears in the "all message tokens"
list but not in the "significant token" list, it's probably because the
token scored between 0.4 and 0.6, which means the statistics do not
indicate ham or spam.
Here's what Spambayes shows for your message that I am responding to
(message itself removed for brevity). Notice there are 101 tokens, of
which 32 are significant for my database. You can see the gap in
significant clues between 0.4 and 0.6, which is apparently 69 tokens.
For this version of Spambayes (1.1a3), you should get the same number of
total tokens. How many significant tokens you get and which ones are
significant depends on the messages you trained, so those are generally
different between users.
In particular, there are tokens for both of the following:
pharmacy
"pharmacy"
but only the one without the quotes was significant given the messages I
trained. It turns out to be a very weak ham clue for me at this point
in time: only one ham and one spam trained that included this token.
---------------------------------------------------------------------
Combined Score: 1% (0.00962107)
Internal ham score (*H*): 0.981202
Internal spam score (*S*): 0.000443716
# ham trained on: 258
# spam trained on: 480
The last time this message was classified or trained:
This message had not been filtered.
This message had not been trained.
32 Significant Tokens
token spamprob #ham #spam
'subject:] ' 0.0348837 6 0
'spambayes' 0.0918367 2 0
'header:Errors-To:1' 0.134887 8 2
'url:org' 0.141459 63 19
'subject:' 0.142075 14 4
'either.' 0.155172 1 0
'email name:spambayes' 0.155172 1 0
'header:Received:10' 0.155172 1 0
'subject:Not' 0.155172 1 0
'subject:Spambayes' 0.155172 1 0
'url:faq' 0.155172 1 0
'url:listinfo' 0.155172 1 0
'url:mailman' 0.155172 1 0
'writes:' 0.155172 1 0
'david' 0.200697 7 3
'subject:: ' 0.24406 52 31
'url:html' 0.262771 32 21
'sender:no real name:2**0' 0.273975 6 4
'header:Mime-Version:1' 0.276019 44 31
'message' 0.288541 44 33
'using' 0.305479 32 26
"i'm" 0.350935 25 25
"isn't" 0.353262 9 9
'setting' 0.360087 3 3
"can't" 0.372129 11 12
'pharmacy' 0.377219 1 1
'make' 0.394 53 64
'header:User-Agent:1' 0.636176 15 49
'huge' 0.767471 3 19
'subject:was' 0.844828 0 1
'skip:_ 40' 0.908163 0 2
'faq' 0.934783 0 3
All Message Tokens
101 unique tokens
'"pharmacy"'
'abrahams'
'asking:'
'before'
'boost'
"can't"
'cc:none'
'check'
'consulting'
'content-type:text/plain'
'dave'
'david'
'does'
'either.'
'email addr:python.org'
'email name:spambayes'
'faq'
'find'
'for'
'from:addr:boost-consulting.com'
'from:addr:dave'
'from:name:david abrahams'
'header:Date:1'
'header:Errors-To:1'
'header:From:1'
'header:Message-ID:1'
'header:Mime-Version:1'
'header:Received:10'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'header:User-Agent:1'
'header:X-Complaints-To:1'
'header?'
'how'
'huge'
"i'm"
'ignore'
"isn't"
'just'
'make'
'message'
'message-id:@valverde.peloton'
'online'
'pharmacy'
'proto:http'
'really'
'reply-to:none'
'seeing,'
'sender:addr:python.org'
'sender:addr:spambayes-bounces'
'sender:no real name:2**0'
'setting'
'skip:_ 40'
'skip:c 10'
'skip:m 10'
'skip:w 20'
'spambayes'
'subject'
'subject:'
'subject: '
'subject: \n\t'
'subject:: '
'subject:? ['
'subject:?]'
'subject:Not'
'subject:Ping'
'subject:Spambayes'
'subject:Subject'
'subject:['
'subject:] '
'subject:header'
'subject:headers'
'subject:ignored'
'subject:mining'
'subject:subject'
'subject:was'
'that'
'that,'
'the'
'to:2**0'
'to:addr:python.org'
'to:addr:spambayes'
'to:no real name:2**0'
'token?'
'url:faq'
'url:html'
'url:listinfo'
'url:mail'
'url:mailman'
'url:net'
'url:org'
'url:python'
'url:sf'
'url:spambayes'
'using'
'what'
'will'
'with'
'writes:'
'x-mailer:none'
More information about the SpamBayes
mailing list