[Spambayes] Analyzing text in image spam

Tim Stone tim at aterraform.com
Mon Aug 21 05:06:36 CEST 2006

One thing to keep in mind when compiling with PIL using distutils, is 
that PIL does dynamic inclusion, so you will get runtime errors on the 
compiled executable.  You have to code the include for the image types 
you support, like:

import BmpImagePlugin
import JpegImagePlugin

Peter Barker wrote:

>I have installed the CVS version as suggested. A couple of points which may 
>help others trying it (especially the PIL). I am using FC5 on AMD64, and had 
>to install tk-devel, tcl-devel as well as tk and tcl (and tkinter etc). To 
>get PIL to successfully include support for everything I had to 
>add /usr/lib64 to the standard paths in setup.py. The freetype2 files 
>required by PIL are in freetype-devel.
>I will report how it performs in a few days. Is there any way I can easily 
>test it with my current spam collection without creating a new .hammiedb and 
>starting again? My email is stored in one file/folder (mbox). I tried just 
>feeding a few messages which had been incorrectly classified, and they were 
>now classified as spam, but I think that is because I had trained them as 
>spam after I received them (with version 1.1a2). I am using kmail with 
>sb_bnfilter.py. Can I tell from the X-Spambayes-Evidence header if the new 
>code is detecting any spam?
>Peter Barker
>>>>>>>"skip" == skip  <skip at pobox.com> writes:
>>I should have given a bit more complete answer based on your message's more
>>general point.  I recently added a fair amount of code to SpamBayes to
>>"crack" the content of images.  The new code works very well for me.  If
>>you'd like to try it, here's what you'll need to do:
>>    1. Check out the latest source from the CVS repository.  (There's been
>>       no new release since my recent checkins.)  Install it.
>>    2. Install the Python Imaging Library:
>>           http://www.pythonware.com/products/pil/
>>    3a. (Windows) Grab the ocrad-cygwin package from the
>>       SpamBayes Files page:
>>           http://sourceforge.net/project/showfiles.php?group_id=61702
>>       Unpack the zip file and copy ocrad.exe somewhere on your PATH.
>>    3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web
>>        site:
>>            http://www.gnu.org/software/ocrad/ocrad.html
>>        Unpack and install it.
>>I realize this may not be all that straightforward for people who are
>>unused to installing open source software.  Once you've done it a couple
>>times though, it gets easier.  Hopefully, we can get another SpamBayes
>>alpha release out in the next little while.  (Tony, if there's anything I
>>can do to help make this happen, let me know.)
>>Once you're ready to go, add the following to your SpamBayes options:
>>    x-lookup_ip: True
>>    lookup_ip_cache: ~/.dnscache
>>    x-image_size: True
>>    x-crack_images: True
>>    crack_image_cache: ~/.image_cache.pickle
>>The first group is unrelated to the image spam, but I find it helps me a
>>lot.  It maps hostnames to their IP addresses using DNS and generates
>>tokens based on those addresses.  The second records tokens about the size
>>of images.  The third enables text extraction from images (OCR, or optical
>>character recognition).  This is where PIL and Ocrad come in.
>>I still get the occasional false negative on image spam, but it's
>>definitely manageable and should improve as Ocrad (itself still a very
>>alpha piece of software) improves.  Even though Ocrad does a poor job of
>>text extraction from a human comprehension standpoint, it generates tokens
>>that SpamBayes just loves and seems to generate enough unique tokens to tip
>>the scales on most image spam.
>SpamBayes at python.org
>Check the FAQ before asking: http://spambayes.sf.net/faq.html

More information about the SpamBayes mailing list