[Spambayes] Ping: subject header ignored? [was: Not mining mySubject headers?]

Seth Goodman sethg at goodmanassociates.com
Tue Feb 6 19:03:09 CET 2007


David Abrahams wrote on Tuesday, February 06, 2007 11:05 AM -0600:

> David Abrahams <dave at boost-consulting.com> writes:
>
> > How is it that for a message with
> >
> >   Subject: Huge online pharmacy
> >
> > Spambayes isn't using "pharmacy" as a classification token?  I can't
> > find a setting that will make it do that, either.
>
> Am I just misinterpreting what I'm seeing, or does SB really ignore
> the Subject header?

The subject header produces tokens that start with the string
"subject:".  When looking at the list of clues Spambayes finds, you
first see the list of "significant tokens", which means up to 150 (?)
tokens that score below 0.4 and above 0.6.  The complete list is shown
as "all message tokens".  It a token appears in the "all message tokens"
list but not in the "significant token" list, it's probably because the
token scored between 0.4 and 0.6, which means the statistics do not
indicate ham or spam.

Here's what Spambayes shows for your message that I am responding to
(message itself removed for brevity).  Notice there are 101 tokens, of
which 32 are significant for my database.  You can see the gap in
significant clues between 0.4 and 0.6, which is apparently 69 tokens.
For this version of Spambayes (1.1a3), you should get the same number of
total tokens.  How many significant tokens you get and which ones are
significant depends on the messages you trained, so those are generally
different between users.

In particular, there are tokens for both of the following:

pharmacy
"pharmacy"

but only the one without the quotes was significant given the messages I
trained.  It turns out to be a very weak ham clue for me at this point
in time:  only one ham and one spam trained that included this token.

---------------------------------------------------------------------

Combined Score: 1% (0.00962107)
Internal ham score (*H*): 0.981202
Internal spam score (*S*): 0.000443716

# ham trained on: 258
# spam trained on: 480

The last time this message was classified or trained:
This message had not been filtered.
This message had not been trained.
32 Significant Tokens
token                               spamprob         #ham  #spam
'subject:] '                        0.0348837           6      0
'spambayes'                         0.0918367           2      0
'header:Errors-To:1'                0.134887            8      2
'url:org'                           0.141459           63     19
'subject:'                          0.142075           14      4
'either.'                           0.155172            1      0
'email name:spambayes'              0.155172            1      0
'header:Received:10'                0.155172            1      0
'subject:Not'                       0.155172            1      0
'subject:Spambayes'                 0.155172            1      0
'url:faq'                           0.155172            1      0
'url:listinfo'                      0.155172            1      0
'url:mailman'                       0.155172            1      0
'writes:'                           0.155172            1      0
'david'                             0.200697            7      3
'subject:: '                        0.24406            52     31
'url:html'                          0.262771           32     21
'sender:no real name:2**0'          0.273975            6      4
'header:Mime-Version:1'             0.276019           44     31
'message'                           0.288541           44     33
'using'                             0.305479           32     26
"i'm"                               0.350935           25     25
"isn't"                             0.353262            9      9
'setting'                           0.360087            3      3
"can't"                             0.372129           11     12
'pharmacy'                          0.377219            1      1
'make'                              0.394              53     64
'header:User-Agent:1'               0.636176           15     49
'huge'                              0.767471            3     19
'subject:was'                       0.844828            0      1
'skip:_ 40'                         0.908163            0      2
'faq'                               0.934783            0      3


All Message Tokens
101 unique tokens

'"pharmacy"'
'abrahams'
'asking:'
'before'
'boost'
"can't"
'cc:none'
'check'
'consulting'
'content-type:text/plain'
'dave'
'david'
'does'
'either.'
'email addr:python.org'
'email name:spambayes'
'faq'
'find'
'for'
'from:addr:boost-consulting.com'
'from:addr:dave'
'from:name:david abrahams'
'header:Date:1'
'header:Errors-To:1'
'header:From:1'
'header:Message-ID:1'
'header:Mime-Version:1'
'header:Received:10'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'header:User-Agent:1'
'header:X-Complaints-To:1'
'header?'
'how'
'huge'
"i'm"
'ignore'
"isn't"
'just'
'make'
'message'
'message-id:@valverde.peloton'
'online'
'pharmacy'
'proto:http'
'really'
'reply-to:none'
'seeing,'
'sender:addr:python.org'
'sender:addr:spambayes-bounces'
'sender:no real name:2**0'
'setting'
'skip:_ 40'
'skip:c 10'
'skip:m 10'
'skip:w 20'
'spambayes'
'subject'
'subject:'
'subject: '
'subject: \n\t'
'subject:: '
'subject:? ['
'subject:?]'
'subject:Not'
'subject:Ping'
'subject:Spambayes'
'subject:Subject'
'subject:['
'subject:] '
'subject:header'
'subject:headers'
'subject:ignored'
'subject:mining'
'subject:subject'
'subject:was'
'that'
'that,'
'the'
'to:2**0'
'to:addr:python.org'
'to:addr:spambayes'
'to:no real name:2**0'
'token?'
'url:faq'
'url:html'
'url:listinfo'
'url:mail'
'url:mailman'
'url:net'
'url:org'
'url:python'
'url:sf'
'url:spambayes'
'using'
'what'
'will'
'with'
'writes:'
'x-mailer:none'



More information about the SpamBayes mailing list