[spambayes-dev] Cool Outlook mystery

Thu Aug 7 20:20:44 EDT 2003

Our bug 782709 is pretty interesting!  Tony just added a good clue to it.
I'll partly confirm it here, and add another bit of evidence.

After retraining and rescoring from scratch, there's a particular msg in my
Ham folder showing a spam score of 3% in my Spam column.  "show spam clues"
rates it much higher:

Spam Score: 0.180576

word                                spamprob         #ham  #spam
'*H*'                               0.722595            -      -
'*S*'                               0.083747            -      -

Some of the token scores are amazing:

'to:no real name:2**0'              0.342745            7      7
'header:To:1'                       0.398161            7      9
'to:2**0'                           0.398161            7      9
'header:Date:1'                     0.64742             1      4
'header:Message-Id:1'               0.764668            0      1
'subject:.'                         0.764668            0      1
'subject: '                         0.846122            0      2
'header:From:1'                     0.871695            1     16

Notice I said this was a ham message, and I trained on it as ham.  Therefore
it shouldn't be possible that I see *any* token (let alone 3) in this
message with a ham-count of 0.  I've certainly got, e.g., way more than
1+4=5 training messages with a Date header too, and way more than 16 with a
"To" header, etc.

In my professional opinion, something is royally hosed <wink>.  My
observations so far match Tony's that it's confined to tokens in headers, so
it's probably not a database bug.