[Spambayes] Spam in Images

Wed Aug 2 03:57:12 CEST 2006

I wonder if the rgb histogram of an image would provide any interesting 
opportunities for "tokens?"  In a real photographic image, that curve is 
generally smooth, and oftentimes flat or a bell-curve, with some 
relatively large number of rgb value counts above some threshold..  I 
would think that in a spam image the histogram would be much more spikey 
with only a few rgb value counts above some percentage of the of the 
total pixels in the picture.

Tim Peters wrote:

>[Alan Arndt]
>  
>
>>Over the past month or more I have noticed a large increase in the amount of
>>spam I receive with the Spam text translated into images.  The actual text
>>of the message is benign gibberish designed to pass Bayesian filters.  They
>>have even taken the step of inserting random bits into the image so that no
>>two images have the same signature.  I've received many multiple messages
>>with the same fundamental image.
>>    
>>
>
>Yup, and they're learning to avoid other stupid mistakes too; e.g.,
>the gibberish /changes/ from one message to the next, and so does the
>forged sender address.  While randomization isn't new in spam, most
>spammers have traditionally done a poor job on it.  For example, for a
>long time it was very effective to train on the gibberish, since
>multiple spammers appeared to use randomization software that produced
>the /same/ gibberish time after time.  Likewise they tended to forge
>the same sender addresses repeatedly.  Most spam still does, for that
>matter.  But some spammers have gotten much smarter.
>
>  
>
>>I haven't thought of a decent way to filter these types of things.
>>    
>>
>
>Me neiither.  They're never false negatives for me, but I reliably get
>a few unsures every day from what appears to be the same pump-and-dump
>scam-spam source (these are messages hard-selling specific penny
>stocks -- the scammer hopes to drive up the market price ("pump") by
>stimulating demand, and then sell quick at a profit ("dump")).
>
>It's very much in the spirit of SpamBayes to generate tokens for what
>the user /sees/, but in these cases we have no idea what the user sees
>(except for the gibberish text).
>
>BTW, it's typical of pump-and-dump scams that they're not trying to
>extract money /directly / from you (they're trying to get you to buy a
>stock on the open market), so we don't even  get a URL or mailing
>address to tokenize.
>
>  
>
>> I hope someone else can and that it can get implemented into SpamBayes.
>>    
>>
>
>It's discussed here (maybe more so on spambayes-dev, the related
>developers' mailing list) regularly, but AFAICT extracting readable
>text from images is a complicated and expensive job.  If someone finds
>a programmatic way to do it cheaply and with reasonable accuracy, I'm
>sure SB could make excellent use of it.
>_______________________________________________
>SpamBayes at python.org
>http://mail.python.org/mailman/listinfo/spambayes
>Check the FAQ before asking: http://spambayes.sf.net/faq.html
>
>
>  
>