[spambayes-dev] Nigerian mystery

Meyer, Tony T.A.Meyer at massey.ac.nz
Sat Sep 6 15:25:13 EDT 2003


> Okay, I now have everything cleaned out, and brand new fresh
> 1.0a5 files installed properly and running. I just got my 
> second Nigerian scam of the day, still rated solidly as Ham.

I pasted this into a message and sent it to myself and it scored 55%,
which isn't bad considering that I would have polluted it with lots of
clues about it being from me.

> I started looking over the thing and figured that Nigeria,
> petroleum, and million had to all be pretty spammy terms for 
> me, so I looked at the clues for this message. Unless I'm 
> losing what little mind I have left, those three terms are 
> not listed in the clues. (And Million appears five times, 
> petroleum and Nigeria twice each.)

The web interface 'show clues' only shows those clues that were used in
determining the classification of the message, not all the tokens in the
message (there is a request to change this, although I'm not sure how to
squeeze in into a single page).  In particular, there's a limit to how
many tokens are used ("Classifier":"max_discriminators"), and a range of
probabilities that aren't used ("Classifier":"minimum_prob_strength").

> I just used the nifty new feature for looking up those words,
> and to my great surprise, those three terms really aren't all 
> that spammy (or hammy, either) for me.

If those words are in the 0.4 to 0.6 range, then they're not used, which
would explain why they weren't in the clue list.

I think the theory goes that you should get *more* of these words in
spam than in ham, so they should still be slightly spammy.  Or that if
you do see them equally, then there should be other words in the message
that give it a spam classification.  For example, "urgent" has a prob of
0.990798 for me.

=Tony Meyer



More information about the spambayes-dev mailing list