[Spambayes] how spambayes handles image-only spams

G. Armour Van Horn vanhorn at whidbey.com
Mon Sep 8 12:42:52 EDT 2003

Another datum:

When I took over the Quotes of the Day project in 2000, the daily mailings
had always been text only. Being a traditional guy, I continued that, but
wanting to sell some advertising, I added an HTML version.

In mid-2000, I think that subscribers were choosing HTML and text in roughly
equal numbers, but this was at a time when AOL users couldn't read HTML mail
reliably, and my signup page clearly warned AOL users not to choose HTML.

In mid-2003, the proportion is at least 90% choosing HTML, I can go days at
a stretch without seeing a single new text subscriber. Every issue includes
a mailto: link to switch from one format to the other, I rarely see this
option used anymore.

In the most recent mailing, 60.4% of the subscribers had elected to receive

I certainly don't consider this proof, and I don't dispute that there are
traditionalists that still live in a text-dominated environment, but it's
pretty clear to me that the world as a whole has overwhelmingly chosen to go
with HTML in e-mail.

Being at least a little traditional myself, I intend to keep the text
version of my mailing available as long as any subscriber wants it, even if
I think that my HTML version is very tasteful and actually more readable
than the text.


Ryan Malayter wrote:

> From: Bill Yerazunis [mailto:wsy at merl.com]
> > Um.... you're arguing politics of desire
> > against actual measured statistics.
> Not really, I have no stake in the prevalence of HTML mail. I'm just
> think that corpora with small amounts of HTML ham are not representative
> of the general, Windows-using email population. And I also think the
> trend of "more HTML ham" will continue, because of the default
> configurations of popular mail clients.
> Given the fact wonderful folks like you actually write these filters for
> the Internet community, I am simply concerned that some harmful design
> decisions were made because your ham corpora are so devoid of HTML.
> > on the grounds that the SpamAssassin corpus
> > is a little less biased, I re-ran the tests
> ...
> > So, it seems that "font" is somewhat spammy,
> > and so is "br", but <a and <td aren't, and
> > <p> is totally equivocal.
> This is what I was getting at, here are results from the most recent
> 1549 messages of each of my own corpora, which are probably biased
> towards HTML ham:
>         ham     ham %   spam    spam %
> <P>     953     61.5%   1022    66.0%
> <BR>    1223    79.0%   1009    65.1%
> <TD     67      4.3%    425     27.4%
> <font   1250    80.7%   1039    67.1%
> <img    53      3.4%    817     52.7%
> Total   1549            1549
> As you can see, because so many people who use Outlook, Outlook Express,
> and Notes to send me ham, HTML tags are present in a great amount of
> what I receive. (Except of course for <TD, which only seems to be ham
> when someone is sending excerpts from a spreadsheet to me, and <img,
> which is only used when people send me photos or joke images.)
> My basic argument is that arbitrarily throwing out some HTML tokens in
> the parser, while leaving others, might make the filter more effective
> for only certain corpora. What test corpora was this decision based on?
> I think keeping some form of <img as tokens as tokens would help my
> detection of image-only spam, which seems to slip through SpamBayes more
> often than other types of spam. I also think it would be even better to
> have a multi-word token something like that produced by the CRM-114
> token generator, which could find multi-tag strings like <img*src*http.
> These suggestions are just based on my knowledge of the algorithms
> involved and the contents of my corpora, I don't know enough python to
> really give them a try in SpamBayes (although I'm working on that ;-).
> Regards,
>         -Ryan-
> _______________________________________________
> Spambayes at python.org
> http://mail.python.org/mailman/listinfo/spambayes
> Check the FAQ before asking: http://spambayes.sf.net/faq.html

