[Spambayes] how spambayes handles image-only spams
Tim Peters
tim.one at comcast.net
Mon Sep 1 20:48:22 EDT 2003
[Ryan Malayter]
> Image-only spams seem to be the only things that really give my
> trained spambayes 007 plugin trouble.
"trouble" means what? That they're classified as ham, or that they're
classified as unsure?
> Many of these have white-on-white garbage text
Then they're not image-only <wink>.
> designed to fool simple "cumulative weighting" filters (which they do
> very well). Spambayes seems to have trouble with them because they have
> so little information, and with spoofed senders and ever-changing
> domains, there is not much in the headers or URLs to score with
> statistical significance.
That would be consistent with trouble meaning "classified as unsure", but
not with trouble meaning "classified as ham".
> When looking at the score reports for various image-only spams, I see
> that tokes url:gif <outbind://5/gif> and url:jpeg <outbind://5/jpeg>
> get high scores. But I noticed that "<IMG... SRC=http", which would
> indicate a hosted image link, is not represented as a special token.
That's true. All HTML decorations are ignored unless/until someone adds
code specifically looking for one. We've added several of those over the
months, but still ignore "almost all" HTML decoration.
> Nor any COLOR=<special indicator that color is near background color>
> tokens.
Also true.
> Only segments of the URL itself show up. Spammers often seem to do
> quite a bit of work to separate the IMG and SRC tags, so it might
> take little extra smarts in the tokenizer to make sure it gets
> done right.
>
> It would seem to me that these IMG and COLOR tokens would be a fairly
> strong spam indicators, at least with my corpus.
The other side to this is that *any* evidence of HTML is a strong spam
indicator in most corpora. For example, you'll find that "<p>" is a strong
spam indicator, if you make the tokenizer produce it. Ditto "<br>". That's
because such a high percentage of spam uses HTML. Early testing showed that
tokenizing all HTML decorations produced classifiers so overwhelmed by
hundreds of correlated "it used HTML" clues that virtually nothing using
HTML could avoid being classified as spam -- even msgs talking *about* HTML
got classed as spam if they included an example. That's why spambayes
backed off to ignoring (at first) all HTML decorations. I think it's fine
to add back (and have added) specific HTML decorations that are usually
unique to spam.
> I think they might provide more information to incriminate the messages
> that have little statistically to score anyway. Has anyone tried these as
> special tokens? If so, what were the results?
I haven't tried it, although I've often intended to add a COLOR token to see
what happens. A problem is that I don't seem to get much spam of this sort,
and the stuff of this kind I get is usually classified as spam anyway (my
spam and ham cutoffs are 80 and 20, btw). Work up a patch and see what
happens!
More information about the Spambayes
mailing list