[Spambayes] Images of commercial text with decoy text are mushing my index

skip at pobox.com skip at pobox.com
Mon Jan 1 16:00:15 CET 2007

    Jamie> since the decoy text is completely non-commercial in nature, it
    Jamie> seems to be polluting my index and making detection less
    Jamie> accurate.  With OCR, will this continue to be an issue?

Sure, if the decoy text actually turns out to be relevant from a scoring
standpoint.  By default the SpamBayes classifier only considers tokens
(words) which score <= 0.4 or >= 0.6.  My guess is that most of the words in
the decoy text are clustered around 0.5 so aren't even considered.


