[Spambayes] how spambayes handles image-only spams

Robert K. Coe bob at 1776.com
Sun Sep 7 20:04:31 EDT 2003


I wonder if you may be overlooking something that could skew your statistics. My experience has been that when I create an HTML message, Outlook actually sends it as a multi-part MIME construct incorporating both HTML and plain-text forms of the message. If the recipient reads the message with an HTML-capable email reader, he'll see the HTML form of the message; otherwise he'll see the plain-text form. If you're collecting your statistics with a plain-text mail reader, or if you're looking only at the plain-text version in a multi-part message, you may be understating the actual use of HTML in messages sent to you.

In fact, if someone knows how to get Outlook to stop sending a plain-text version of HTML messages, I'd like to hear about it. Now that almost everybody can read HTML messages, I think the plain-text version is superfluous.

Bob

MIS Department, City of Cambridge
831 Massachusetts Ave, Cambridge MA 02139  ·  617-349-4217  ·  fax 617-349-6165


> -----Original Message-----
> From: Bill Yerazunis [mailto:wsy at merl.com]
> Sent: Saturday, September 06, 2003 10:29 AM
> To: rmalayter at bai.org
> Cc: spambayes at python.org
> Subject: Re: [Spambayes] how spambayes handles image-only spams
> 
> 
> 
>    From: "Ryan Malayter" <rmalayter at bai.org>
> 
>    > Statistically speaking, HTML mail is 
>    > either from a spammer or from a clueless 
>    > git, and in either case can usually be 
>    > delayed without penalty or discarded outright.
> 
>    As indicated above, I do not think this analysis is true anymore. And
>    characterizing someone as a clueless git because they don't change their
>    mail client's default message format or "love" plain text... Well, let
>    us know when you get back to the real world. 
> 
> Um.... you're arguing politics of desire against actual measured
> statistics.  
> 
> In my current CRM114 corpus (which is running realtime and delivering
> better accuracy than I myself can deliver- well over 99.9%):
> 
> SingleToken  Spam	   Nonspam
> 
> <p>	     49		   0
> <br>	     207	   32
> <td>	     48		   0
> <font	     57		   2
> <a	     117	   2
> 
> Other HTML tokens have similar statistics.  The margin of error on
> each of these (aliasing probability) is 1 - 1/2^64, in other words, a
> few billionths of a percent of a chance that this is due to aliasing
> in the database.
> 
> E pur si moivre, dude.  E pur si moivre.
> 
>   -Bill Yerazunis




More information about the Spambayes mailing list