[Spambayes] how spambayes handles image-only spams
Bill Yerazunis
wsy at merl.com
Thu Sep 4 08:12:24 EDT 2003
From: "Ryan Malayter" <rmalayter at bai.org>
From: Tim Peters [mailto:tim.one at comcast.net]
[...]
> The other side to this is that *any* evidence of HTML
> is a strong spam indicator in most corpora... virtually
> nothing using HTML could avoid being classified as spam...
This doesn't seem right to me, at least on an intuitive level. We're an
Outlook 2003 shop, and we've used Windows Group Policies to force all
internal users to create HTML messages instead of Microsoft RTF format.
So a great big heaping pile of my non-spam corpus would be messages that
contain <P> <BR> and other "innocent" HTML tags. Shouldn't the
statistical nature of SpamBayes give these tokens something near 0.5 as
a score, since they appear frequently in both corpora?
No, my corpora agree with Tim Peters - spammers use HTML far more
often than "normal" users.
Statistically speaking, HTML mail is either from a spammer or from
a clueless git, and in either case can usually be delayed without
penalty or discarded outright.
Similarly, base-64 encodes are almost _always_ trash.
-Bill Yerazunis
More information about the Spambayes
mailing list