[Spambayes] how spambayes handles image-only spams
garry at zvolve.com
Tue Sep 2 02:33:42 EDT 2003
On Mon, Sep 01, 2003 at 19:48:22 -0400, Tim Peters wrote:
> [Ryan Malayter]
> > It would seem to me that these IMG and COLOR tokens would be a fairly
> > strong spam indicators, at least with my corpus.
> The other side to this is that *any* evidence of HTML is a strong spam
> indicator in most corpora. For example, you'll find that "<p>" is a strong
> spam indicator, if you make the tokenizer produce it. Ditto "<br>". That's
> because such a high percentage of spam uses HTML. Early testing showed that
> tokenizing all HTML decorations produced classifiers so overwhelmed by
> hundreds of correlated "it used HTML" clues that virtually nothing using
> HTML could avoid being classified as spam -- even msgs talking *about* HTML
> got classed as spam if they included an example. That's why spambayes
> backed off to ignoring (at first) all HTML decorations. I think it's fine
> to add back (and have added) specific HTML decorations that are usually
> unique to spam.
Yet, Paul Graham thinks[*] the URL in the IMG tag will correlate
accurately because it points to the spammer's message. Why not
tokenize the value of the link?
[*] From http://www.paulgraham.com/sofar.html:
Sending the spam as an image instead of text doesn't work
either, because you need certain html tags to display an
image, and these all end up having very high spam
probabilities. Particularly the url. If you use a domain name
and it's one that has shown up in spams before, you're dead.
If you use an ip address instead, you're even deader. No
tokens have higher spam probabilities than numbers in a url.
Garry Williams, Zvolve Systems, Inc., +1 770 813-4934
More information about the Spambayes