Over the past month or more I have noticed a large increase in the amount of spam I receive with the Spam text translated into images. The actual text of the message is benign gibberish designed to pass Bayesian filters. They have even taken the step of inserting random bits into the image so that no two images have the same signature. I've received many multiple messages with the same fundamental image. I haven't thought of a decent way to filter these types of things. I hope someone else can and that it can get implemented into SpamBayes. Until recently I was very please that only a few messages a day (of 20-50) make it through. Does anyone have any good suggestions? I can certainly send plenty of samples if someone needs them Thanks, Alan Arndt alan five eighty two at do it now.com
Alan> I haven't thought of a decent way to filter these types of things. Alan> I hope someone else can and that it can get implemented into Alan> SpamBayes.... Alan> Does anyone have any good suggestions? This topic has come up several times in the past. There is, as yet, no perfect way to identify these sorts of spams. The last time it came up (maybe a month ago), optical character recognition (OCR) came up as a possible means of getting at the text. Unfortunately, the open source tools available fall far short of the mark as far as accuracy is concerned. Perhaps image size would be a helpful clue. I don't know if anyone has tried that before. Skip
Skip, Thanks. I did look through the last three months of forum archive. I didn't specifically see this addressed. I did notice some comments about porn images, etc. I don't think the image size works. I just saved about 20 of my most recent spam images and while the vast majority (1/2) are pushing a stock and most of them are pretty similar in size they aren't all the same. The other 10 images were all of quite varying sizes, even 3 with exactly the same text were deliberately made quite different in size. I was amazed that they had gone to the extent of adding random bits into each of the images, but I guess they knew someone would try to compare them. I am not an expert at looking at the raw data of the e-mail. I can only hope that there is some way they reference them that might be different from images sent to me by friends. But I'm not optimistic that one can determine that by the attributes of the image or the rest of the message itself. So that does lead to examining the image. The main difference is immediately obvious. It's not a picture, it's just some formatted text put into an image. Given that one could hope that some simple image analysis process could quickly classify them as different, or even make it a learning process like the rest of the spam filtering. The big downside is that image analysis is expensive compute and time wise. Not to mention all the various formats of images that the tool would need to process. Perhaps that is constrained by limitations of what e-mail programs will actually render. All in all, not a good outlook it seems. -Alan -----Original Message----- From: skip@pobox.com [mailto:skip@pobox.com] Sent: Tuesday, August 01, 2006 4:29 PM To: Alan Arndt Cc: spambayes@python.org Subject: Re: [Spambayes] Spam in Images Alan> I haven't thought of a decent way to filter these types of things. Alan> I hope someone else can and that it can get implemented into Alan> SpamBayes.... Alan> Does anyone have any good suggestions? This topic has come up several times in the past. There is, as yet, no perfect way to identify these sorts of spams. The last time it came up (maybe a month ago), optical character recognition (OCR) came up as a possible means of getting at the text. Unfortunately, the open source tools available fall far short of the mark as far as accuracy is concerned. Perhaps image size would be a helpful clue. I don't know if anyone has tried that before. Skip
Alan> I don't think the image size works. I just saved about 20 of my Alan> most recent spam images and while the vast majority (1/2) are Alan> pushing a stock and most of them are pretty similar in size they Alan> aren't all the same. We use a trick for sizes that tends to work pretty well. Instead of noting the precise size, we note the log of the size in base 2 and then throw away the fraction. I just implemented that for image sizes and got these results using my current training database: token,nspam,nham,spam prob image-size:2**5,1,0,0.844827586207 image-size:2**6,4,1,0.5 image-size:2**7,4,1,0.5 image-size:2**8,6,0,0.96511627907 image-size:2**9,3,0,0.934782608696 image-size:2**10,7,1,0.620791675168 image-size:2**11,9,0,0.97619047619 image-size:2**12,13,0,0.983271375465 image-size:2**13,14,0,0.984429065744 image-size:2**14,53,0,0.995790458372 image-size:2**15,19,1,0.813543282782 That doesn't necessarily mean much without some testing. I don't tend to get a lot of ham with images. I'll create a patch and add it to the SpamBayes website so others can try it out. Skip
[Alan Arndt]
Over the past month or more I have noticed a large increase in the amount of spam I receive with the Spam text translated into images. The actual text of the message is benign gibberish designed to pass Bayesian filters. They have even taken the step of inserting random bits into the image so that no two images have the same signature. I've received many multiple messages with the same fundamental image.
Yup, and they're learning to avoid other stupid mistakes too; e.g., the gibberish /changes/ from one message to the next, and so does the forged sender address. While randomization isn't new in spam, most spammers have traditionally done a poor job on it. For example, for a long time it was very effective to train on the gibberish, since multiple spammers appeared to use randomization software that produced the /same/ gibberish time after time. Likewise they tended to forge the same sender addresses repeatedly. Most spam still does, for that matter. But some spammers have gotten much smarter.
I haven't thought of a decent way to filter these types of things.
Me neiither. They're never false negatives for me, but I reliably get a few unsures every day from what appears to be the same pump-and-dump scam-spam source (these are messages hard-selling specific penny stocks -- the scammer hopes to drive up the market price ("pump") by stimulating demand, and then sell quick at a profit ("dump")). It's very much in the spirit of SpamBayes to generate tokens for what the user /sees/, but in these cases we have no idea what the user sees (except for the gibberish text). BTW, it's typical of pump-and-dump scams that they're not trying to extract money /directly / from you (they're trying to get you to buy a stock on the open market), so we don't even get a URL or mailing address to tokenize.
I hope someone else can and that it can get implemented into SpamBayes.
It's discussed here (maybe more so on spambayes-dev, the related developers' mailing list) regularly, but AFAICT extracting readable text from images is a complicated and expensive job. If someone finds a programmatic way to do it cheaply and with reasonable accuracy, I'm sure SB could make excellent use of it.
I wonder if the rgb histogram of an image would provide any interesting opportunities for "tokens?" In a real photographic image, that curve is generally smooth, and oftentimes flat or a bell-curve, with some relatively large number of rgb value counts above some threshold.. I would think that in a spam image the histogram would be much more spikey with only a few rgb value counts above some percentage of the of the total pixels in the picture. Tim Peters wrote:
[Alan Arndt]
Over the past month or more I have noticed a large increase in the amount of spam I receive with the Spam text translated into images. The actual text of the message is benign gibberish designed to pass Bayesian filters. They have even taken the step of inserting random bits into the image so that no two images have the same signature. I've received many multiple messages with the same fundamental image.
Yup, and they're learning to avoid other stupid mistakes too; e.g., the gibberish /changes/ from one message to the next, and so does the forged sender address. While randomization isn't new in spam, most spammers have traditionally done a poor job on it. For example, for a long time it was very effective to train on the gibberish, since multiple spammers appeared to use randomization software that produced the /same/ gibberish time after time. Likewise they tended to forge the same sender addresses repeatedly. Most spam still does, for that matter. But some spammers have gotten much smarter.
I haven't thought of a decent way to filter these types of things.
Me neiither. They're never false negatives for me, but I reliably get a few unsures every day from what appears to be the same pump-and-dump scam-spam source (these are messages hard-selling specific penny stocks -- the scammer hopes to drive up the market price ("pump") by stimulating demand, and then sell quick at a profit ("dump")).
It's very much in the spirit of SpamBayes to generate tokens for what the user /sees/, but in these cases we have no idea what the user sees (except for the gibberish text).
BTW, it's typical of pump-and-dump scams that they're not trying to extract money /directly / from you (they're trying to get you to buy a stock on the open market), so we don't even get a URL or mailing address to tokenize.
I hope someone else can and that it can get implemented into SpamBayes.
It's discussed here (maybe more so on spambayes-dev, the related developers' mailing list) regularly, but AFAICT extracting readable text from images is a complicated and expensive job. If someone finds a programmatic way to do it cheaply and with reasonable accuracy, I'm sure SB could make excellent use of it. _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
In article <1f7befae0608011809rf7af374qf357c02b1d8d3c83@mail.gmail.com>, Tim Peters <tim.peters@gmail.com> Tue, 1 Aug 2006 21:09:50 writes
but AFAICT extracting readable text from images is a complicated and expensive job. If someone finds a programmatic way to do it cheaply and with reasonable accuracy, I'm sure SB could make excellent use of it.
The samples I have been getting this last month have the image chopped up into many separate jig-saw pieces. I am told that Outlook will kindly stitch it all together to show a single image but my safe email client just shows the separate files. I presume this would complicate analysis even further. What I find strange is that this type of spam represents 99% of the spam I receive (20 or so a day). It seems to be the only spam that manages to regularly get through the Brightmail filter that my ISP uses. If they can't nail the problem then I doubt if there is much to be done here. -- Les Desser (The Reply-to address IS valid)
hi, guys: i have some same experience with Alan, the image spams without hyperlink i received are more and more these days, i notice that many html code of image has the same format like <IMG ALT="" border="0" SRC="cid:BB7EED6ABA@ascaniopainting.com"> what the cid here mean? does it valueable for recognize this spam? --- mailto:abryson@bytefocus.com homepage:http://www.Wang-Labs.com 2006/8/2, Alan Arndt <aga@jlw.com>:
Over the past month or more I have noticed a large increase in the amount of spam I receive with the Spam text translated into images. The actual text of the message is benign gibberish designed to pass Bayesian filters. They have even taken the step of inserting random bits into the image so that no two images have the same signature. I've received many multiple messages with the same fundamental image. I haven't thought of a decent way to filter these types of things. I hope someone else can and that it can get implemented into SpamBayes. Until recently I was very please that only a few messages a day (of 20-50) make it through.
Does anyone have any good suggestions? I can certainly send plenty of samples if someone needs them
Thanks, Alan Arndt alan five eighty two at do it now.com _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
-- Have a Good Day
Alice> .... i notice that many html code of image has the same format Alice> like <IMG ALT="" border="0" Alice> SRC="cid:BB7EED6ABA@ascaniopainting.com"> what the cid here mean? Alice> does it valueable for recognize this spam? It just identifies an image that is delivered along with the message. By itself it doesn't mean a lot. Skip
"skip" == skip <skip@pobox.com> writes:
Alice> .... i notice that many html code of image has the same format Alice> like <IMG ALT="" border="0" Alice> SRC="cid:BB7EED6ABA@ascaniopainting.com"> what the cid here mean? Alice> does it valueable for recognize this spam? skip> It just identifies an image that is delivered along with the skip> message. By itself it doesn't mean a lot. I should have given a bit more complete answer based on your message's more general point. I recently added a fair amount of code to SpamBayes to "crack" the content of images. The new code works very well for me. If you'd like to try it, here's what you'll need to do: 1. Check out the latest source from the CVS repository. (There's been no new release since my recent checkins.) Install it. 2. Install the Python Imaging Library: http://www.pythonware.com/products/pil/ 3a. (Windows) Grab the ocrad-cygwin package from the SpamBayes Files page: http://sourceforge.net/project/showfiles.php?group_id=61702 Unpack the zip file and copy ocrad.exe somewhere on your PATH. 3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web site: http://www.gnu.org/software/ocrad/ocrad.html Unpack and install it. I realize this may not be all that straightforward for people who are unused to installing open source software. Once you've done it a couple times though, it gets easier. Hopefully, we can get another SpamBayes alpha release out in the next little while. (Tony, if there's anything I can do to help make this happen, let me know.) Once you're ready to go, add the following to your SpamBayes options: x-lookup_ip: True lookup_ip_cache: ~/.dnscache x-image_size: True x-crack_images: True crack_image_cache: ~/.image_cache.pickle The first group is unrelated to the image spam, but I find it helps me a lot. It maps hostnames to their IP addresses using DNS and generates tokens based on those addresses. The second records tokens about the size of images. The third enables text extraction from images (OCR, or optical character recognition). This is where PIL and Ocrad come in. I still get the occasional false negative on image spam, but it's definitely manageable and should improve as Ocrad (itself still a very alpha piece of software) improves. Even though Ocrad does a poor job of text extraction from a human comprehension standpoint, it generates tokens that SpamBayes just loves and seems to generate enough unique tokens to tip the scales on most image spam. Skip
Skip, That sounds great. Thanks. I don't know if I will take all the steps to try and get it up and running or wait for a new release, but we really appreciate it. I did have a few questions. How much time/processing does the OCR take? I would think that might be very intensive. Not that most people don't have the cycles to spare, or that it wouldn't be much faster than scanning the spam myself, but I'm just curious. Also, should one re-initialize the spam database? Are all tokens the same, once extracted these are just like any other? Or are they somehow grouped to relate to images? Thanks, Alan -----Original Message----- From: skip@pobox.com [mailto:skip@pobox.com] Sent: Friday, August 18, 2006 6:33 AM To: Alice Bryson <abryson@bytefocus.com>; spambayes@python.org; Alan Arndt Subject: Analyzing text in image spam (was: Spam in Images)
"skip" == skip <skip@pobox.com> writes:
Alice> .... i notice that many html code of image has the same format Alice> like <IMG ALT="" border="0" Alice> SRC="cid:BB7EED6ABA@ascaniopainting.com"> what the cid here mean? Alice> does it valueable for recognize this spam? skip> It just identifies an image that is delivered along with the skip> message. By itself it doesn't mean a lot. I should have given a bit more complete answer based on your message's more general point. I recently added a fair amount of code to SpamBayes to "crack" the content of images. The new code works very well for me. If you'd like to try it, here's what you'll need to do: 1. Check out the latest source from the CVS repository. (There's been no new release since my recent checkins.) Install it. 2. Install the Python Imaging Library: http://www.pythonware.com/products/pil/ 3a. (Windows) Grab the ocrad-cygwin package from the SpamBayes Files page: http://sourceforge.net/project/showfiles.php?group_id=61702 Unpack the zip file and copy ocrad.exe somewhere on your PATH. 3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web site: http://www.gnu.org/software/ocrad/ocrad.html Unpack and install it. I realize this may not be all that straightforward for people who are unused to installing open source software. Once you've done it a couple times though, it gets easier. Hopefully, we can get another SpamBayes alpha release out in the next little while. (Tony, if there's anything I can do to help make this happen, let me know.) Once you're ready to go, add the following to your SpamBayes options: x-lookup_ip: True lookup_ip_cache: ~/.dnscache x-image_size: True x-crack_images: True crack_image_cache: ~/.image_cache.pickle The first group is unrelated to the image spam, but I find it helps me a lot. It maps hostnames to their IP addresses using DNS and generates tokens based on those addresses. The second records tokens about the size of images. The third enables text extraction from images (OCR, or optical character recognition). This is where PIL and Ocrad come in. I still get the occasional false negative on image spam, but it's definitely manageable and should improve as Ocrad (itself still a very alpha piece of software) improves. Even though Ocrad does a poor job of text extraction from a human comprehension standpoint, it generates tokens that SpamBayes just loves and seems to generate enough unique tokens to tip the scales on most image spam. Skip
On Friday 18 August 2006 06:32, skip@pobox.com wrote:
"skip" == skip <skip@pobox.com> writes:
Alice> .... i notice that many html code of image has the same format Alice> like <IMG ALT="" border="0" Alice> SRC="cid:BB7EED6ABA@ascaniopainting.com"> what the cid here mean? Alice> does it valueable for recognize this spam?
skip> It just identifies an image that is delivered along with the skip> message. By itself it doesn't mean a lot.
I should have given a bit more complete answer based on your message's more general point. I recently added a fair amount of code to SpamBayes to "crack" the content of images. The new code works very well for me. If you'd like to try it, here's what you'll need to do:
1. Check out the latest source from the CVS repository. (There's been no new release since my recent checkins.) Install it.
2. Install the Python Imaging Library: http://www.pythonware.com/products/pil/
3a. (Windows) Grab the ocrad-cygwin package from the SpamBayes Files page: http://sourceforge.net/project/showfiles.php?group_id=61702 Unpack the zip file and copy ocrad.exe somewhere on your PATH.
3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web site: http://www.gnu.org/software/ocrad/ocrad.html Unpack and install it.
I realize this may not be all that straightforward for people who are unused to installing open source software. Once you've done it a couple times though, it gets easier. Hopefully, we can get another SpamBayes alpha release out in the next little while. (Tony, if there's anything I can do to help make this happen, let me know.)
Once you're ready to go, add the following to your SpamBayes options:
x-lookup_ip: True lookup_ip_cache: ~/.dnscache
x-image_size: True
x-crack_images: True crack_image_cache: ~/.image_cache.pickle
The first group is unrelated to the image spam, but I find it helps me a lot. It maps hostnames to their IP addresses using DNS and generates tokens based on those addresses. The second records tokens about the size of images. The third enables text extraction from images (OCR, or optical character recognition). This is where PIL and Ocrad come in.
I still get the occasional false negative on image spam, but it's definitely manageable and should improve as Ocrad (itself still a very alpha piece of software) improves. Even though Ocrad does a poor job of text extraction from a human comprehension standpoint, it generates tokens that SpamBayes just loves and seems to generate enough unique tokens to tip the scales on most image spam.
Skip _______________________________________________ SpamBayes@python.org http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html Just had an Idea that you may want to think about.
If a message has an html part with a CID that references a proper domain name/ip address, why not add an option to tag that as high spam probablility unless it's on a whitelist. What I'm thinking is that most clients that can handle html mail now include the option to not load images from the web. Personally, I prefer and pretty much strictly use plain text and the only html format mail I even consider legit is white listed and I suspect many of us have the same belief
On Friday, August 18, 2006 12:19 PM -0500, Fast Turtle wrote:
If a message has an html part with a CID that references a proper domain name/ip address, why not add an option to tag that as high spam probablility unless it's on a whitelist.
What I'm thinking is that most clients that can handle html mail now include the option to not load images from the web. Personally, I prefer and pretty much strictly use plain text and the only html format mail I even consider legit is white listed and I suspect many of us have the same belief
I don't feel the same way at all. I get various commercial newsletters in HTML format, and I have no problem with it. Whitelisting is something I don't care to do and the HTML email usually has a text MIME part that satisfies SpamBayes' curiosity. Some of my clients use HTML email, and this is one battle I don't care taking to that front. In addition, some of my friends keep sending me HTML mail even though I've asked them for plain text only. My conclusion is that I'm stuck with HTML mail for the foreseeable future. This is not a perfect world. -- Seth Goodman
On Fri, 2006-08-18 at 08:32 -0500, skip@pobox.com wrote: <snip>
Once you're ready to go, add the following to your SpamBayes options:
x-lookup_ip: True lookup_ip_cache: ~/.dnscache
Is someone using this option? To me seems that this option alone do nothing. You have to enable both x-lookup_ip and x-pick_apart_urls. Is it right or am I missing something? Once both are enabled it seems to work but the mail processing is very very slow. -- Luigi Pugnetti Symbolic S.p.A. V.le Mentana, 29 I-43100 Parma Italy Tel: +39 0521 708811 Fax: +39 0521 776190
>> Once you're ready to go, add the following to your SpamBayes options: >> >> x-lookup_ip: True >> lookup_ip_cache: ~/.dnscache >> Luigi> Is someone using this option? To me seems that this option alone Luigi> do nothing. You have to enable both x-lookup_ip and Luigi> x-pick_apart_urls. Is it right or am I missing something? Perhaps. I can't recall. Do you have PyDNS installed? Luigi> Once both are enabled it seems to work but the mail processing is Luigi> very very slow. First time through, yes. After that, it should (in theory) rely on its cache of IP address information. I may have some pending checkins for that though (*). Note also that a fairly small training database works for me (fewer than 100 hams, 250-300 spams). If you have a massive training database, then, yes, this will slow things down dramatically. The IP lookup and image OCR stuff changes the properties of your database enough that I think it's worth retraining from scratch. Skip (*) Alas, I didn't get around to checking stuff in last night. Maybe over the weekend. S
On Fri, 2006-11-03 at 09:56 -0600, skip@pobox.com wrote:
>> Once you're ready to go, add the following to your SpamBayes options: >> >> x-lookup_ip: True >> lookup_ip_cache: ~/.dnscache >>
Luigi> Is someone using this option? To me seems that this option alone Luigi> do nothing. You have to enable both x-lookup_ip and Luigi> x-pick_apart_urls. Is it right or am I missing something?
Perhaps. I can't recall. Do you have PyDNS installed?
Yes, I have PyDNS installed. I used tcpdump to monitor dns requests and there are no requests if x-pick_apart_urls is disabled. Looking into the code seems that the check for x-lookup_ip is inside a if(pick_url enabled) construct
Luigi> Once both are enabled it seems to work but the mail processing is Luigi> very very slow.
First time through, yes. After that, it should (in theory) rely on its cache of IP address information. I may have some pending checkins for that though (*). Note also that a fairly small training database works for me (fewer than 100 hams, 250-300 spams). If you have a massive training database, then, yes, this will slow things down dramatically. The IP lookup and image OCR stuff changes the properties of your database enough that I think it's worth retraining from scratch.
I have tried on a sample of 5000 emails but I stopped it because after more than half an hour it didn't finish. From tcpdump I could see a request every 1,2 seconds (or something like that) now even considering that not every mail contains an url it was very slow. As a note I tried it on windows XP with ocr scanning enabled but ocr alone was much faster.
Skip
(*) Alas, I didn't get around to checking stuff in last night. Maybe over the weekend.
S
-- Luigi Pugnetti Symbolic S.p.A. V.le Mentana, 29 I-43100 Parma Italy Tel: +39 0521 708811 Fax: +39 0521 776190
Luigi> Once both are enabled it seems to work but the mail processing is Luigi> very very slow. >> First time through, yes. After that, it should (in theory) rely on >> its cache of IP address information. I may have some pending >> checkins for that though (*). Note also that a fairly small training >> database works for me (fewer than 100 hams, 250-300 spams). If you >> have a massive training database, then, yes, this will slow things >> down dramatically. The IP lookup and image OCR stuff changes the >> properties of your database enough that I think it's worth retraining >> from scratch. Luigi> I have tried on a sample of 5000 emails but I stopped it because Luigi> after more than half an hour it didn't finish. From tcpdump I Luigi> could see a request every 1,2 seconds (or something like that) Luigi> now even considering that not every mail contains an url it was Luigi> very slow. As a note I tried it on windows XP with ocr scanning Luigi> enabled but ocr alone was much faster. I can't imagine a scenario where I would need 5000 emails to get decent results with SpamBayes. If that was the common case, everyone would give up on it long before it was of any use. I still suggest you try starting from scratch. Skip
>> Once you're ready to go, add the following to your SpamBayes options: >> >> x-lookup_ip: True >> lookup_ip_cache: ~/.dnscache >> Luigi> Is someone using this option? To me seems that this option alone Luigi> do nothing. You have to enable both x-lookup_ip and Luigi> x-pick_apart_urls. Is it right or am I missing something? Yes, you are right. x-pick_apart_urls needs to be enabled to use x-lookup_ip. My apologies for not catching that before. Skip
<IMG ALT="" border="0" SRC="cid:BB7EED6ABA@ascaniopainting.com"> what the cid here mean? does it valueable for recognize this spam?
It's a content-id to point to the resource (image source, css content, script) within a multipart message with an HTML part. These are usually random, so it doesn't help. Regards, Shawn K. Hall http://12PointDesign.com/
participants (10)
-
Alan Arndt -
Alice Bryson <abryson@bytefocus.com> -
Fast Turtle -
Les Desser -
Luigi Pugnetti -
Seth Goodman -
Shawn K. Hall -
skip@pobox.com -
Tim Peters -
Tim Stone