[Spambayes] Images of commercial text with decoy text are mushing my index

skip at pobox.com skip at pobox.com
Mon Jan 1 16:46:25 CET 2007


    Jamie> With OCR, will this continue to be an issue?

Forgot to answer this question.  The decoy text will still be considered
using the same parameters.  By default, the classifier only considers the
150 most highest and lowest scoring tokens, so if the message is near that
limit, adding high- or low-scoring OCR-generated tokens will push some other
tokens out of consideration.  OTOH, the problem with most of these image
spams is generally that there are very few tokens of any significance.  They
tend to score near 0.50 as a whole without the contribution of OCR-generated
tokens.  (Most of the tokens extracted from the decoy text generally score
near 0.5 and are discarded.)

The only way to tell for sure is to examine the tokens generated and their
scores to see what is contributing to the overall classification.

Skip



More information about the SpamBayes mailing list