[Spambayes] how spambayes handles image-only spams
rmalayter at bai.org
Mon Sep 8 12:52:25 EDT 2003
From: Bill Yerazunis [mailto:wsy at merl.com]
> Um.... you're arguing politics of desire
> against actual measured statistics.
Not really, I have no stake in the prevalence of HTML mail. I'm just
think that corpora with small amounts of HTML ham are not representative
of the general, Windows-using email population. And I also think the
trend of "more HTML ham" will continue, because of the default
configurations of popular mail clients.
Given the fact wonderful folks like you actually write these filters for
the Internet community, I am simply concerned that some harmful design
decisions were made because your ham corpora are so devoid of HTML.
> on the grounds that the SpamAssassin corpus
> is a little less biased, I re-ran the tests
> So, it seems that "font" is somewhat spammy,
> and so is "br", but <a and <td aren't, and
> <p> is totally equivocal.
This is what I was getting at, here are results from the most recent
1549 messages of each of my own corpora, which are probably biased
towards HTML ham:
ham ham % spam spam %
<P> 953 61.5% 1022 66.0%
<BR> 1223 79.0% 1009 65.1%
<TD 67 4.3% 425 27.4%
<font 1250 80.7% 1039 67.1%
<img 53 3.4% 817 52.7%
Total 1549 1549
As you can see, because so many people who use Outlook, Outlook Express,
and Notes to send me ham, HTML tags are present in a great amount of
what I receive. (Except of course for <TD, which only seems to be ham
when someone is sending excerpts from a spreadsheet to me, and <img,
which is only used when people send me photos or joke images.)
My basic argument is that arbitrarily throwing out some HTML tokens in
the parser, while leaving others, might make the filter more effective
for only certain corpora. What test corpora was this decision based on?
I think keeping some form of <img as tokens as tokens would help my
detection of image-only spam, which seems to slip through SpamBayes more
often than other types of spam. I also think it would be even better to
have a multi-word token something like that produced by the CRM-114
token generator, which could find multi-tag strings like <img*src*http.
These suggestions are just based on my knowledge of the algorithms
involved and the contents of my corpora, I don't know enough python to
really give them a try in SpamBayes (although I'm working on that ;-).
More information about the Spambayes