[Spambayes] Image spam (Was: is the database empty)

Amedee Van Gasse amedee at amedee.be
Sat Jun 10 21:08:55 CEST 2006


On Sat, June 10, 2006 15:16, Amedee Van Gasse said:
>
> On Sat, June 10, 2006 14:18, yahoo.de said:
>>
>> how could i train the SB to recognize emails with advertistment images
>> for some product and so on? let see the email has no text, but onla an
>> image in the  body! (i know there are image scanner software for this
>> purpose, but what could be done in such cases)
>
> Image spam is indeed a problem. Otoh, in my personal experience it's only
> a problem in theory. In practice there are enough other spammy
> characteristics in such emails.
>
> I don't know about image scanners specifically for spam detection, but I
> think it's possible to feed emails trough such image scanners before
> they are fed to spambayes.
>
> I can imagine one could make an ocr program that converts images to text
> (if possible) and attaches the text to the email, which is subsequently
> fed to spambayes. That way, spambayes virtually "reads" the image just
> like a human does.
>
> Actually, you suggest something interesting. I'm going to try a few
> things and if they work, I'll post it on the list.

Hello again,

I found an interesting program that might be exactly what you are looking
for: ocrad. This is GNU software, can accept pbm files or standard input,
and outputs text to standard output. So this is a commandline ocr program
that can be used in a script. Don't worry about the pbm files, the ocred
manual describes how to convert other image formats to pbm (jpeg, png, ps,
pdf,...)

So what you could do in a prefiltering script (like a procmail script) is:
* extract the images from the email
* convert them to pbm
* send the pbm files to ocrad
* attach the resulting text to the original mail
* finally, let spambayes do its magic

However I am a bit concerned about performance of doing an ocr of every
single image you receive. Also I don't agree with the thesis that image
spam (or banner spam) will not be recognised as spam by spambayes. I think
spambayes *will* find enough tokens to give the mail a score that is not
unsure.
For the rare occasions that image spam will result in an unsure score, I
suggest the following strategy:

1. score the email with spambayes (preliminary score)
2. everything with score 1 is 100% sure spam (high spam), so dump it to
/dev/null (my thesis is that image spam will be caught most of the time)
3. for every mail that is "low ham", unsure, or "low spam" AND has an
image, convert the image(s) to pbm, ocr with ocred, and attach text to
email
4. rescore the email with spmabayes (final score)
5. continue with your usual filtering rules

Note:
high ham = 100% sure ham, messages with a score of 0.00
low ham = probably ham, but with a score > 0.00 (you can use other
treshold values)
low spam = probably spam, but with a score < 1.00
high spam = 100% sure spam, messages with a score of 1.00


The actual implementation of these ideas are left as an excercise to the
reader :)

-- 
Amedee Van Gasse



More information about the SpamBayes mailing list