[Spambayes] how spambayes handles image-only spams
Ryan Malayter
rmalayter at bai.org
Mon Sep 8 15:56:12 EDT 2003
From: Skip Montanaro [mailto:skip at pobox.com]
> Do you have any evidence which suggests
> that SpamBayes is not properly scoring
> your mail?
Nothing but a few spams of the "image-only" type spams that slipped
through with scores in the 60% range.
> What HTML tokens are kept? Which are
> thrown out? As far as I know, all are
> discarded, though URLs are checked.
I consider URLs to basically be HTML tags, since they often come from
the inside of an HREF or IMG tag. Even if it's a plain-text URL, that
serves the same function as an HREF tag, so it should be handled the
same, right?
> If most of the mail you get containing
> <img> tags is spam, I suspect url:gif and
> url:jpg are spammy for you as well.
As you suspected they are fairly, but not overwhelmingly spammy:
'url:gif' 0.767005 775 526
'url:jpg' 0.798242 398 325
> I think SpamBayes is extracting just about
> all the useful content it can from the message
> already, even from the <img> tags. Adding an
> html:img token probably wouldn't change the way
> any given message scores (it wouldn't be much
> spammier than url:gif or url:jpg).
Most, but not all, of the useful content. For instance, in my corpora,
HTML comments are only present in spam. Wouldn't an html:comment token
provide more discriminating information than skipping the token
altogether? Also, looking for a FONT tag with a COLOR=<white> would help
discriminate the "image-only" spams I see. The only reason I raised this
issue is a few "image-only" spams with white-on-white random text have
gotten by my SpamBayes (007 Plug-in) filter.
That said, I'm getting a 96.3% capture rate now, with zero false
positives to date. I'm very happy with SpamBayes, I just want to help
make it a little bit better if possible. My intuition tells me ignoring
HTML tags is ignoring useful content, but I could be totally wrong.
I'm going to figure out how to add these tokens to a customized parser
on my own, and report on the results. I'll see if they help at all.
Regards,
-Ryan-
More information about the Spambayes
mailing list