[Spambayes] Analyzing text in image spam
Peter Barker
peterb at zeta.org.au
Mon Aug 21 01:35:23 CEST 2006
I have installed the CVS version as suggested. A couple of points which may
help others trying it (especially the PIL). I am using FC5 on AMD64, and had
to install tk-devel, tcl-devel as well as tk and tcl (and tkinter etc). To
get PIL to successfully include support for everything I had to
add /usr/lib64 to the standard paths in setup.py. The freetype2 files
required by PIL are in freetype-devel.
I will report how it performs in a few days. Is there any way I can easily
test it with my current spam collection without creating a new .hammiedb and
starting again? My email is stored in one file/folder (mbox). I tried just
feeding a few messages which had been incorrectly classified, and they were
now classified as spam, but I think that is because I had trained them as
spam after I received them (with version 1.1a2). I am using kmail with
sb_bnfilter.py. Can I tell from the X-Spambayes-Evidence header if the new
code is detecting any spam?
Regards,
Peter Barker
> >>>>> "skip" == skip <skip at pobox.com> writes:
>
> I should have given a bit more complete answer based on your message's more
> general point. I recently added a fair amount of code to SpamBayes to
> "crack" the content of images. The new code works very well for me. If
> you'd like to try it, here's what you'll need to do:
>
> 1. Check out the latest source from the CVS repository. (There's been
> no new release since my recent checkins.) Install it.
>
> 2. Install the Python Imaging Library:
> http://www.pythonware.com/products/pil/
>
> 3a. (Windows) Grab the ocrad-cygwin package from the
> SpamBayes Files page:
> http://sourceforge.net/project/showfiles.php?group_id=61702
> Unpack the zip file and copy ocrad.exe somewhere on your PATH.
>
> 3b. (Unix/Linux/Mac) Grab the ocrad source distribution from its web
> site:
> http://www.gnu.org/software/ocrad/ocrad.html
> Unpack and install it.
>
> I realize this may not be all that straightforward for people who are
> unused to installing open source software. Once you've done it a couple
> times though, it gets easier. Hopefully, we can get another SpamBayes
> alpha release out in the next little while. (Tony, if there's anything I
> can do to help make this happen, let me know.)
>
> Once you're ready to go, add the following to your SpamBayes options:
>
> x-lookup_ip: True
> lookup_ip_cache: ~/.dnscache
>
> x-image_size: True
>
> x-crack_images: True
> crack_image_cache: ~/.image_cache.pickle
>
> The first group is unrelated to the image spam, but I find it helps me a
> lot. It maps hostnames to their IP addresses using DNS and generates
> tokens based on those addresses. The second records tokens about the size
> of images. The third enables text extraction from images (OCR, or optical
> character recognition). This is where PIL and Ocrad come in.
>
> I still get the occasional false negative on image spam, but it's
> definitely manageable and should improve as Ocrad (itself still a very
> alpha piece of software) improves. Even though Ocrad does a poor job of
> text extraction from a human comprehension standpoint, it generates tokens
> that SpamBayes just loves and seems to generate enough unique tokens to tip
> the scales on most image spam.
>
> Skip
More information about the SpamBayes
mailing list