[Spambayes] date for new release to handle image spam?

Fri Feb 2 06:41:44 CET 2007

skip at pobox.com wrote on Thursday, February 01, 2007 11:27 AM -0600:

> As to the "creating more synthetic tokens", I'm open to suggestions.
> Ignoring its OCR features, I think SpamBayes currently identifies
> that an image is present, its mime type (distinguishing gif spams
> from Grandma's jpeg photos for example) the log of its size.  Maybe
> it could generate clues related to the image's dimensions, the total
> number of images in the email or number of distinct colors.  Do you
> have other suggestions?

Exactly which clues are significant is the $64,000 question, just as it
is with word frequencies.  The approach that SpamBayes took with that
problem may be applicable here.  Rather than try to imagine which clues
will be definitive, I was thinking out loud if we might provide a large
number of seemingly unrelated clues and letting the Bayesian classifier
look for correlations.  We can't guess in advance what those clues
should be, so the idea is to provide as many different ones as possible
and hope that Spambayes finds some useful.  Maybe things like animation
rate, contrast ratio, color bias, ... any actual piece of information
that varies from one image to the next.  There are probably a lot of
metrics available to people who are expert in image processing.  Then
there are the email specific ones like content transfer encoding of each
MIME part, total characters in each MIME part, character set, etc.

--
Seth Goodman