[Spambayes] how spambayes handles image-only spams

Mon Sep 1 03:20:42 EDT 2003

Image-only spams seem to be the only things that really give my trained
spambayes 007 plugin trouble. Many of these have white-on-white garbage
text designed to fool simple "cumulative weighting" filters (which they
do very well). Spambayes seems to have trouble with them because they
have so little information, and with spoofed senders and ever-changing
domains, there is not much in the headers or URLs to score with
statistical significance.

When looking at the score reports for various image-only spams,  I see
that tokes url:gif <outbind://5/gif>  and url:jpeg <outbind://5/jpeg>
get high scores. But I noticed that "<IMG... SRC=http", which would
indicate a hosted image link, is not represented as a special token. Nor
any COLOR=<special indicator that color is near background color>
tokens. Only segments of the URL itself show up. Spammers often seem to
do quite a bit of work to separate the IMG and SRC tags, so it might
take little extra smarts in the tokenizer to make sure it gets done
right.

It would seem to me that these IMG and COLOR tokens would be a fairly
strong spam indicators, at least with my corpus. I think they might
provide more information to incriminate the messages that have little
statistically to score anyway. Has anyone tried these as special tokens?
If so, what were the results?

Thanks for your help,
    -ryan-