[Spambayes] how spambayes handles image-only spams

Ryan Malayter rmalayter at bai.org
Mon Sep 8 15:56:12 EDT 2003

From: Skip Montanaro [mailto:skip at pobox.com] 
> Do you have any evidence which suggests 
> that SpamBayes is not properly scoring 
> your mail?  

Nothing but a few spams of the "image-only" type spams that slipped
through with scores in the 60% range.

> What HTML tokens are kept?  Which are 
> thrown out?  As far as I know, all are
> discarded, though URLs are checked.

I consider URLs to basically be HTML tags, since they often come from
the inside of an HREF or IMG tag. Even if it's a plain-text URL, that
serves the same function as an HREF tag, so it should be handled the
same, right?

> If most of the mail you get containing 
> <img> tags is spam, I suspect url:gif and 
> url:jpg are spammy for you as well.

As you suspected they are fairly, but not overwhelmingly spammy:
'url:gif'     0.767005          775    526
'url:jpg'     0.798242          398    325

> I think SpamBayes is extracting just about 
> all the useful content it can from the message 
> already, even from the <img> tags. Adding an 
> html:img token probably wouldn't change the way 
> any given message scores (it wouldn't be much 
> spammier than url:gif or url:jpg).  

Most, but not all, of the useful content. For instance, in my corpora,
HTML comments are only present in spam. Wouldn't an html:comment token
provide more discriminating information than skipping the token
altogether? Also, looking for a FONT tag with a COLOR=<white> would help
discriminate the "image-only" spams I see. The only reason I raised this
issue is a few "image-only" spams with white-on-white random text have
gotten by my SpamBayes (007 Plug-in) filter.

That said, I'm getting a 96.3% capture rate now, with zero false
positives to date. I'm very happy with SpamBayes, I just want to help
make it a little bit better if possible. My intuition tells me ignoring
HTML tags is ignoring useful content, but I could be totally wrong.

I'm going to figure out how to add these tokens to a customized parser
on my own, and report on the results. I'll see if they help at all.


