[Spambayes] how spambayes handles image-only spams

Mon Sep 8 23:37:28 EDT 2003

[Ryan Malayter, to Bill Yerazunis]
> ...
> My basic argument is that arbitrarily throwing out some HTML tokens in
> the parser, while leaving others, might make the filter more effective
> for only certain corpora. What test corpora was this decision based
> on?

Note that Bill doesn't work on spambayes; he heads CRM114:

    http://crm114.sourceforge.net/

IIUC, CRM114 doesn't throw away any HTML decorations; spambayes throws away
all HTML decorations (although it sucks out and specially tokenizes
everything that "looks like" a URL, and regardless of whether it's hiding in
HTML or sitting in plain view).

spambayes was developed against many peoples' test corpora, although, as I
said before, I don't think any of them had a significant quantity of HTML
ham (and I don't think Bill's did either).  Some did have *some* HTML ham,
though, and that's what drove the spambayes decision to throw away HTML
decorations (else-- and we're just going in circles here --there was no
chance that the little bit of HTML ham wouldn't get misclassified as spam
every time; my saying that isn't a matter of argument, it's reporting what
actually happened).

> I think keeping some form of <img as tokens as tokens would help my
> detection of image-only spam, which seems to slip through SpamBayes
> more often than other types of spam.

I believe that, not really because it's image-only, but because image-only
messages often have very few features to judge.  It's hard for a brief
message to get a strong score in either direction in spambayes.  It's easier
under CRM114 because that generates many more features from a given message
than spambayes generates.

> I also think it would be even better to have a multi-word token something
> like that produced by the CRM-114 token generator, which could find
> multi-tag strings like <img*src*http. These suggestions are just based on
> my knowledge of the algorithms involved and the contents of my corpora,
> I don't know enough python to really give them a try in SpamBayes
> (although I'm working on that ;-).

In spambayes I'd be more inclined to write special code to identify the
img-src-http dance, and synthesize a token for that.  It's only one token,
though, and all tokens carry the same weight here -- it may still not be
enough to give "a typical" short message of this ilk a strong enough score
to nail it.  The only way to know is to try it.