[Spambayes] date for new release to handle image spam?
sethg at goodmanassociates.com
Sat Jan 6 07:58:29 CET 2007
David Abrahams wrote on Friday, January 05, 2007 9:22 AM -0600:
> "Seth Goodman" <sethg at goodmanassociates.com> writes:
> > Image spam is gradually moving in the direction of a captcha:
> > images that people can identify but computers can't. How far they
> > can go before it becomes so annoying that no one will look at it is
> > anyone's guess. As long as people can design effective captcha's,
> > it will be possible to construct image spam that OCR will not
> > detect.
> Yes, I understand the principle. Of course, the effectiveness of
> captchas depends on the ineffectiveness of OCR. On the other hand,
> most OCR is built to deal with reasonably legible text, so we may need
> spam-specific OCR tools.
The human eye and brain are amazing image analyzers. OCR is only
ineffective when compared to them. While our visual sense can be
fooled, i.e. "optical illusions", it's power is that it is robust to so
many forms of noise and image degradation. You don't need training to
find the text in a captcha. We are told it's there and we all just see
it. OCR programs use a variety of mathematical methods plus heuristics
and they require care and feeding to function at all. This is why
computers will remain behind humans in processing images for the
foreseeable future. Make OCR as "spam-specific" as you like, but it
will require tweaking each time spammers change to an unusual font,
background noise or text distortion. I don't want to seem morose about
this, but I don't believe it's a battle we can ultimately win. It can
still assist Spambayes classifying messages with image spam, but it's
not a silver bullet.
This is really a problem to be solved at the MTA with stricter
connection rules. Nonetheless, I suspect that Spambayes could improve
by creating more synthetic tokens that describe the image better and
taking advantage of serendipitous differences between tokens for image
spam and those in each user's ham. I'm not sure what those attributes
are, but it probably beats trying to keep up with a quickly evolving
captcha. Outlook doesn't help the situation, as it destroys much of the
MIME armor that might provide useful spam clues.
More information about the SpamBayes