[Spambayes] how spambayes handles image-only spams
Robert K. Coe
bob at 1776.com
Sun Sep 7 20:04:31 EDT 2003
I wonder if you may be overlooking something that could skew your statistics. My experience has been that when I create an HTML message, Outlook actually sends it as a multi-part MIME construct incorporating both HTML and plain-text forms of the message. If the recipient reads the message with an HTML-capable email reader, he'll see the HTML form of the message; otherwise he'll see the plain-text form. If you're collecting your statistics with a plain-text mail reader, or if you're looking only at the plain-text version in a multi-part message, you may be understating the actual use of HTML in messages sent to you.
In fact, if someone knows how to get Outlook to stop sending a plain-text version of HTML messages, I'd like to hear about it. Now that almost everybody can read HTML messages, I think the plain-text version is superfluous.
MIS Department, City of Cambridge
831 Massachusetts Ave, Cambridge MA 02139 · 617-349-4217 · fax 617-349-6165
> -----Original Message-----
> From: Bill Yerazunis [mailto:wsy at merl.com]
> Sent: Saturday, September 06, 2003 10:29 AM
> To: rmalayter at bai.org
> Cc: spambayes at python.org
> Subject: Re: [Spambayes] how spambayes handles image-only spams
> From: "Ryan Malayter" <rmalayter at bai.org>
> > Statistically speaking, HTML mail is
> > either from a spammer or from a clueless
> > git, and in either case can usually be
> > delayed without penalty or discarded outright.
> As indicated above, I do not think this analysis is true anymore. And
> characterizing someone as a clueless git because they don't change their
> mail client's default message format or "love" plain text... Well, let
> us know when you get back to the real world.
> Um.... you're arguing politics of desire against actual measured
> In my current CRM114 corpus (which is running realtime and delivering
> better accuracy than I myself can deliver- well over 99.9%):
> SingleToken Spam Nonspam
> <p> 49 0
> <br> 207 32
> <td> 48 0
> <font 57 2
> <a 117 2
> Other HTML tokens have similar statistics. The margin of error on
> each of these (aliasing probability) is 1 - 1/2^64, in other words, a
> few billionths of a percent of a chance that this is due to aliasing
> in the database.
> E pur si moivre, dude. E pur si moivre.
> -Bill Yerazunis
More information about the Spambayes