[Spambayes] Spam in Images
Tim Stone
tim at aterraform.com
Wed Aug 2 03:57:12 CEST 2006
I wonder if the rgb histogram of an image would provide any interesting
opportunities for "tokens?" In a real photographic image, that curve is
generally smooth, and oftentimes flat or a bell-curve, with some
relatively large number of rgb value counts above some threshold.. I
would think that in a spam image the histogram would be much more spikey
with only a few rgb value counts above some percentage of the of the
total pixels in the picture.
Tim Peters wrote:
>[Alan Arndt]
>
>
>>Over the past month or more I have noticed a large increase in the amount of
>>spam I receive with the Spam text translated into images. The actual text
>>of the message is benign gibberish designed to pass Bayesian filters. They
>>have even taken the step of inserting random bits into the image so that no
>>two images have the same signature. I've received many multiple messages
>>with the same fundamental image.
>>
>>
>
>Yup, and they're learning to avoid other stupid mistakes too; e.g.,
>the gibberish /changes/ from one message to the next, and so does the
>forged sender address. While randomization isn't new in spam, most
>spammers have traditionally done a poor job on it. For example, for a
>long time it was very effective to train on the gibberish, since
>multiple spammers appeared to use randomization software that produced
>the /same/ gibberish time after time. Likewise they tended to forge
>the same sender addresses repeatedly. Most spam still does, for that
>matter. But some spammers have gotten much smarter.
>
>
>
>>I haven't thought of a decent way to filter these types of things.
>>
>>
>
>Me neiither. They're never false negatives for me, but I reliably get
>a few unsures every day from what appears to be the same pump-and-dump
>scam-spam source (these are messages hard-selling specific penny
>stocks -- the scammer hopes to drive up the market price ("pump") by
>stimulating demand, and then sell quick at a profit ("dump")).
>
>It's very much in the spirit of SpamBayes to generate tokens for what
>the user /sees/, but in these cases we have no idea what the user sees
>(except for the gibberish text).
>
>BTW, it's typical of pump-and-dump scams that they're not trying to
>extract money /directly / from you (they're trying to get you to buy a
>stock on the open market), so we don't even get a URL or mailing
>address to tokenize.
>
>
>
>> I hope someone else can and that it can get implemented into SpamBayes.
>>
>>
>
>It's discussed here (maybe more so on spambayes-dev, the related
>developers' mailing list) regularly, but AFAICT extracting readable
>text from images is a complicated and expensive job. If someone finds
>a programmatic way to do it cheaply and with reasonable accuracy, I'm
>sure SB could make excellent use of it.
>_______________________________________________
>SpamBayes at python.org
>http://mail.python.org/mailman/listinfo/spambayes
>Check the FAQ before asking: http://spambayes.sf.net/faq.html
>
>
>
>
More information about the SpamBayes
mailing list