[Spambayes] how spambayes handles image-only spams

Bill Yerazunis wsy at merl.com
Thu Sep 4 08:12:24 EDT 2003


   From: "Ryan Malayter" <rmalayter at bai.org>

   From: Tim Peters [mailto:tim.one at comcast.net] 

   [...]

   > The other side to this is that *any* evidence of HTML 
   > is a strong spam indicator in most corpora...  virtually 
   > nothing using HTML could avoid being classified as spam...

   This doesn't seem right to me, at least on an intuitive level. We're an
   Outlook 2003 shop, and we've used Windows Group Policies to force all
   internal users to create HTML messages instead of Microsoft RTF format.
   So a great big heaping pile of my non-spam corpus would be messages that
   contain <P> <BR> and other "innocent" HTML tags. Shouldn't the
   statistical nature of SpamBayes give these tokens something near 0.5 as
   a score, since they appear frequently in both corpora?

No, my corpora agree with Tim Peters - spammers use HTML far more
often than "normal" users.

Statistically speaking, HTML mail is either from a spammer or from 
a clueless git, and in either case can usually be delayed without 
penalty or discarded outright.

Similarly, base-64 encodes are almost _always_ trash.

	   -Bill Yerazunis



More information about the Spambayes mailing list